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INVESTIGATIONS OF THE SIMPLEX* 


Luoyp G. Humpureyst 


UNIVERSITY OF ILLINOIS 


I am honored to be able to speak on the 25th anniversary of the founding 
of the Psychometric Society. It would be appropriate for your president on 
this occasion to review the 25-year history of our society and of mathematical 
methods in psychology. Fortunately our committee headed by Charles 
Wrigley has taken care of this need, and speakers better qualified than I will 
speak to these issues. I shall turn to a less demanding problem. 

In limiting my topic I was faced with certain constraints arising from 
my training, background, and interests. One constraint arises from a misspent 
youth during which too many years were spent studying disciplines other 
than mathematics. A second constraint, which probably stems from my 
training in experimental psychology, is a tendency to be more interested 
in data than theory. Finally, there is the Air Force personnel research ex- 
perience which developed in me an abiding interest in applied prediction 
problems. If you consider this combination to be an odd one for the Presi- 
dent of the Psychometric Society, don’t blame me. You elected me! 

Although I shall introduce my topic by discussing the weather, I am 
not trying to avoid coming to grips with a problem. My problem does have 
something to do with the weather, and it also has something to do with 
Guttman’s simplex [5]. My discussion will be a little bit mathematical and 
quite a bit empirical. It also has applied implications. 


Prediction in Meteorology and in Psychology 


In my experience, measurement psychologists talk frequently about 
weather prediction problems. Clearly there is a sense of kinship involved. 
Very frequently this talk takes the line that the respective accuracy of 
psychological and meteorological predictions is not very different, i.e., both 
are highly fallible. The psychologist, as a matter of fact, may actually exhibit 
some feelings of superiority because, it is claimed, the weather man is not 
able to do much if any better than predict for tomorrow what happened 
today. By implication, if not by forthright statement, psychological pre- 

*Presidential address delivered to the Psychometric Society, Chicago, Illinois, Sep- 
tember 6, 1960. 

{The research on which this paper is based was supported in part by the University 


Research Board of the University of Illinois and, the larger part by the National Institute 
of Mental Health. 
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dictions involving factor-pure tests, multiple regression equations, criterion 
analysis, and the like are considered to be much more sophisticated. This 
I doubt. 

It takes only a quick look at our ability tests and proficiency measures, 
after we strip away the blinders our verbiage has created, to conclude that we 
are also predicting tomorrow’s performance from today’s performance. 
Whether an ability test be labelled aptitude, intelligence, or achievement it 
samples current performance. Whether the proficiency measure is a pass-fail 
dichotomy, a graphic rating, an output measure, or another test, it is also 
a sample of performance or a reflection of total performance. Stick and rudder 
coordination before training predicts pass-fail in subsequent pilot training; 
intermediate mathematics of the CEEB at the end of high school predicts 
mathematics grades in the freshman college year; similarly the English test 
predicts grades in English composition, etc. The basis of our claim for so- 
phistication going beyond the prediction of tomorrow’s weather from knowl- 
edge of what happened today lies largely in analysis of the variables to be 
measured, the construction of devices to measure these variables, and the 
methods used in relating today’s variables to tomorrow’s. We are actually 
far below the level of sophistication of the meteorologist in predicting change. 
We have nothing comparable to his pressure systems, temperature gradients, 
etc. Furthermore, we stand in as much need of dynamic principles as he 
does, since the prediction of tomorrow’s status from knowledge of today’s 
behavior is neither very accurate nor very elegant. 

It is possible to relate this point of view to the distinction between 
S-R and R-R laws for which Spence [6] is responsible. Prediction of the 
weather from the dynamics of movements of air masses is akin to Spence’s 
S-R laws; prediction of tomorrow’s weather from today’s is the meteorological 
equivalent of the use of R-R laws. 

Because I like Spence’s distinction it does not necessarily follow that I 
would recommend his version of S-R laws as an appropriate model. I do 
conclude that present R-R laws are of limited accuracy: in order to improve 
we must concern ourselves with the dynamics of change. The rest of this 
paper will be concerned with the documentation for this conclusion. It is 
thus more a dissection of the problem than a description of the cure. It 
should not be called a post-mortem examination, however, since the body of 
psychological prediction is still very much alive. Difficulties with R-R laws 
are not sufficient grounds for discarding them since in the prediction of an 
individual’s performance the alternative is still the crystal ball. 


A Statistical Basis For Change 


My starting place is the simple-minded assumption that living organisms 
show constant change. We label the major kinds of change as growth and 
decay, learning and forgetting, warm up and fatigue. I also make the assump- 
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tion that change can be conceptualized in terms of a succession of increments 
between measurement occasions. In a given sample the various increments 
are described in terms of their means, variances, and covariances. The 
measurement occasions are trials, phases, time periods, etc. 

Let the scores of successive measurements be represented by 2, , 
XZ. ,°** , £, . Let these scores in turn be considered the sum of a series of 
increments, d, as follows: 


m1=d,, te = d,+d,,°++,% = Dd;. 
t=1 

The convention that the Ith occasion occurs before the mth and the latter 
before the nth will also be adopted. We shall also be concerned with true 
scores in the beginning, deferring temporarily the problems raised by meas- 
urement error. 

From the formula for the correlation of sums, the correlation between 
the measures / and m is written as follows: 





l m 
s be Vaid jFai Fa; 
_ G21 i=1 j=14+1 
(1) Trizm = + 2 
Crm F2,Fr», 


From (1) it is apparent that, unless the correlations between increments 
are unity, the correlation between any two measurement occasions will 
be less shan unity and that the further apart the measures are in the series 
the lower will be the correlation between them. These properties coincide 
iwith the descriptive characteristics of the simplex. 

Since I did not assume independent increments, it might be supposed 
that the more analytic definition of the simplex, of which one aspect is that 
1 2t2,:2m = -00, would not hold. In order to check on this supposition we need 
the two additional correlations. 


n 
a 2; Vaid jFa;Fa; 











Oz i=1 j=m+1 
(2) . Temen ser . + : 
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If the specified partial correlation is to be zero, then r,,,, must equal the 
product of r,,., and r.,2, - It is easy to see that the equality holds if we 
assume that all of the covariances sum to zero. This is of course equivalent 
to Guttman’s original development of the simplex. It can also be shown, 
however, that the equality will still hold if mean variances and covariances 
computed from different parts of the matrix are equal. This is less restrictive 
than Guttman’s assumption and I suspect is at least approximately true for 
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sizable blocks of data involving successive measures obtained during growth 
and learning. 

Probably of greater importance than the goodness of the above assump- 
tion in obtaining fits to empirical matrices is the contamination introduced 
by measurement error. Partialling out a fallible intermediate variable in a 
simplex will not reduce the correlation between the remote variables to zero. 
Use of correlations corrected for attenuation comes to mind, but unbiased 
reliability estimates are presently impossible to obtain during periods of 
rapid change. For most learning data, for example, we must be content to 
look for simplex matrices at the descriptive level, but once we start looking 
large numbers of examples should be found. This expectation is based on 
nothing more profound than the assumption that perfect correlations are 
inherently improbable. 


Examples of the Simplex Pattern 


Two of the best examples of matrices that follow the simplex pattern 
have been published by Fleishman and Hempel [3, 4]. Both involved com- 
puting the intercorrelations of eight trials during learning for each of two 
motor skills. I shall only summarize their data. For Complex Coordination, 
the stick and rudder coordination task, the mean correlation between ad- 
jacent trials was .86 while the correlation between the first and eighth trials 
was .59. For Discrimination Reaction Time the corresponding values were 
-78 and .56. There are very few reversals of the simplex form in either matrix. 

Anderson’s summary of the stability of intelligence test performance [1] 
shows clearly that these data also assume the simplex form. Although com- 
plete intercorrelation matrices are not presented, the trends of the corre- 
lations between initial and succeeding measures, and between final and 
preceding measures are sufficiently convincing. Correlations between initial 
and final measures are lower at all ages than correlations between adjacent 
ages. 

Anderson also anticipated Guttman in a sense. He concluded that the 
correlational data were consistent with the hypothesis that the increments 
of growth were unrelated to the base to which they were added. Anderson 
also reports correlations involving height and weight. Again there is no doubt 
but that these assume the simplex form. Prediction of tomorrow’s height 
from today’s is quite accurate in general, but if the time interval is long or 
if the initial measure is obtained early in life there would be many errors 
in predicting terminal status. 

In predicting adult height from height at six we would not choose to 
confuse the issue by calling the latter “aptitude for growing.” There are 
also many errors in predicting undergraduate college success from proficiency 
with words in various forms at the age of six, but here we compound our 
difficulties by calling measures of the latter sort “verbal aptitude.” 
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Joseph Lewin has obtained correlational data from an experimental 
course in shorthand that seem to be of the simplex form.* Subjects partici- 
pated in eight two-hour learning sessions and had their proficiency measured 
at the end of each session by means of separately timed parallel forms, thus 
providing a reliability estimate for each “trial.” These results appear in 
Table 1. Raw correlations appear above the diagonal while correlations 
corrected for attenuation appear below the diagonal. 


Table 1 
Intercorrelations of Short-hand Sessions 


(N = 28) 











1 39: -60 41 Mt SEO 2S 35 
2 43 16-80 76 70 2. O65 
3 G5. > 82 85 78 79 77 76 
4 44 36° - 92 91 90 6&5 £8 
5 SY. 6r °'86) - 97 92 91 93 
6 So PR Be GS 97 91 91 
7 27 76 . 82 90° 36 96 & 





The N is only 28, but with the exception of the initial variable the 
simplex form appears. The low correlations of the first proficiency test with 
everything else, including the adjacent session, are explained by an error 
in setting the time limits for this pair of parallel forms. The time limits were 
too generous—what were intended as speeded tests became power tests. 

One measure of goodness of fit is the size of the discrepancy between 
212, and the product of r,,.,, and r,,, - Thirty-five such discrepancies were 
calculated for the last seven variables. The algebraic mean was —.009 and 
the absolute mean discrepancy was .045. Intuition tells me that this is a 
good fit for these data. 

As a final example, though a relatively uncontrolled one, I shall present 
the intercorrelations of grade-point averages for each of the eight semesters 
of work in electrical engineering for a sample of 91 subjects.t These students 
did not take precisely the same courses, nor did they always take identical 
courses in the same order. I do claim that the curriculum was moderately 


*Unpublished study, University of Illinois, 1960. 
+These data were gathered under the supervision of Mr. Emmanuel Lask and ana- 
lyzed by Mr. Aart Hazewinkel. 
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standardized. Also only those students who made normal progress were re- 
tained. These intercorrelations appear in Table 2. Although an internal 
consistency approach to reliability estimates would have been possible, these 
were not obtained. Judging in terms of correlations between semesters, 
particularly between adjacent semesters, the reliability of the fifth semester 
grade average is probably lower than the rest, but even without taking 
reliability into account the matrix is close to the simplex form. 


Tebie 2 
Intercorrelations of Semester Grades in Electrical Engineering 


(N= 91) 











l 2 3 4 5 5 7 3 
l 69 55 46 45 41 34 33 
2 65 58 50 69 41 44 
3 65 59 60 56 52 
4 62 63 66 53 
5 61 64 68 
6 6S 63 
7 72 





All of the matrices presented clearly show the problems involved in 
predicting ‘‘tomorrow”’ what happened today. The more remote in time 
(or process) “tomorrow” is, the less accurate is the prediction. In selection 
research one should not be satisfied with validation of predictors against 
the earliest possible criteria. Secondly, if change is rapid, the use of work- 
sample selection tests is highly suspect. Improvements in prediction of 
remote criteria may come from the measurement of many more facets of 
the individual than we are currently assessing or from the discovery of new 
methods of combining our variables, but I am convinced that in many areas 
we must start taking into account the stimulus situation with which the 
person will be faced between the initial and final measurements. 


Reliability Estimation for Variables in a Simplex 


The need for reliabilities for these and other data is acute. This need 
led me to an interesting method for reliability estimation, though it is a bit 
too incestuous to use in determining whether a set of correlations is a true 
simplex. The method turns the problem around, assumes a simplex, and 
solves for the reliabilities. These are given by 
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Table 3 


Reliability Estimates for Motor Skills Learning 








Complex Coordination Trials Discrimination Reaction Time Trials 
(N = 197) (N = 264) 
2 3 4 5 6 7 2 3 4 5 6 7 





1,00 - 1. 


-85 - 
80 - 
5 « 
-70 - 


Means 


04 2 1 L 1 


99: 2 1 1 1 l 
94 21 ee | 8 1 3 3 2 
89 1 “ 2 6 2 os 7 6 3 L 
84 1 2 2 4 8 4 3 
79 1 L L 
74 1 2 
«96. ..91 .93 95 185 .92 -86 .86 .86 .83 .79 .77 





(4) 


r pat rz 12m! mre 
ae rire 


Since equations for the expected partials of zero from which (4) was derived 
cannot be written for the first and last measures in a series, this method 
leaves us with two missing reliabilities. It does provide multiple, though 
dependent, estimates of the others in all matrices larger than three-by-three. 

Although a side issue in the present context, this approach allows one to 


Table 4 


Reliability Estimates for Semester Grade Averages 














Electrical Engineering Liberal Arts and Sciences 
(N = 91) (N = 314) 
2 3 4 5 6 7 2 3 4 5 6 7 
1,00 - 1.04 1 1 
95 - .99 
-90 - .94 L 1 1 1 1 1 
-85 - .89 2 2 1 1 1 
-80 - .84 2 l 1 2 i 
ee? | 1 3 1 3 1 
-70 - .74 2 + 1 2 1 2 3 
-65 + .79 1 3 2 2 2 
-60 - .64 2 3 1 3 2 2 2 L 4 
35 - 359 3 2 4 2 2 1 
-50- .54 1 2 3 5 3 1 
+45 = .49 2 1 
40 - 44 1 3 
Means 02 90 TO A: .. 8 OB: a2 2°. 
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estimate reliabilities of individual items in unifactor scales. If a sample of 
items were perfectly reliable and were loaded on only one factor, the product 
moment correlations among items differing in difficulty or popularity would 
form a simplex. Thus, assuming that departures from a perfect simplex are 
the result of measurement errors, reliabilities of all but the easiest and most 
difficult items can be estimated as above. Such estimates should be somewhat 
higher than item reliabilities estimated by the internal consistency method; 
basically, present estimates are of parallel-form reliabilities. 

It is of interest to see how sensible the results of these computations are. 
Tables 3 and 4 contain distributions of reliability estimates from several 
sources of data. Table 5 presents additional distributions plus a comparison 
with parallel forms estimates. 


Table 5 
Reliability Estimates for Experimental Short-hand Course 


(N = 28) 








3 4 5 6 7 





1.05 - 1.09 1 1 

1,00 - 1.04 1 

95 - .99 5 5 2 1 
90 - .94 1 3 6 

-85 - .89 2 4 
-80 - .84 2 


75 - .79 1 
Means a eee eo) OF «SOR ASD 


‘Parallel forms estimates .92 .93 .94 .95 .9%6 





My reaction to these data, intuitive again, is that the basic assumption 
seems fairly reasonable for the two motor skills and for the shorthand but 
not very realistic for the academic grades. For the latter the variance of the 
estimates is too great even though the mean values are not out of line with 
common-sense expectations. 


Factor Descriptions of the Simplex Form 


The preceding observations suggest that the intercorrelations of grade 
averages do not form a true simplex. The students may be gradually chang- 
ing in accordance with the simplex assumptions, but there is some confound- 
ing factor or factors in addition. Presumably this is due to the changing 
course content from one semester to the next. As a matter of fact the subjects 
would not need to change at all if changes in course content were sufficiently 
systematic. A gradual shift in emphasis from verbal to quantitative materials, 
for example, would produce a matrix resembling a simplex. Thus the factor 
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TABLE 6 


A Two-Factor Structure for a Pseudo Simplex 








1 2 3 4 5 6 7 8 9 
Factor I VIO SEO ATT VEO” VIO VO VIO 20 VIO 
Factor II S10 SIO VIO VW20 ¥S0 60 v0 50 V9 


n? 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 








loadings in Table 6 form a matrix of intercorrelations which fit the descriptive 
definition of the simplex, but the partial correlations, r,,,,.2,, are not zero. 

Two factors represent the minimum number required to produce a 
matrix resembling a simplex though other illustrations involving more than 
two and less than n factors could be invented. This suggests a possible merg- 
ing of common factor theory and simplex theory, particularly since DuBois 
[2] has shown that the communality and minimum rank notions can be 
applied to simplex matrices. There is still, however, a major gap. DuBois’ 
minimum rank communalities alternate between values less than one and 
unity depending only on the rank order of the variables. It can also be shown 
that these minimum rank communalities for the odd-numbered variables 
are the squared multiple correlations between each such variable and the 
rest in the matrix. (I laboriously developed a tedious proof of this theorem 
for up to seven variables. Dr. Henry Kaiser easily developed an elegant 
general proof for which I am most grateful.) It makes no sense psychologically 
for Guttman’s lower and upper bounds for communalities to be represented 
in the same matrix and to alternate between odd- and even-numbered 
variables. I conclude that the minimum rank model is not psychologically 
sound for the true simplex. 

Such matrices should not be factored in this fashion whether they stand 
alone or are included in a larger matrix. The data of Fleishman and Hempel, 
for example, should be reanalyzed. The alternative to the minimum rank 
model is the use of the diagonal method with something other than com- 
munalities in the principal diagonal. Guttman originally suggested the use 
of unities, but he was dealing with infallible data. For fallible data logic 
dictates the use of reliabilities, but again extracting as many factors as 
variables. In mixed matrices the simplex should be analyzed separately 
from the remaining variables, and vice-versa. 

To distinguish between true and pseudo simplices in fallible data may be 
very difficult. One possibility is to obtain reliability estimates in some inde- 
pendent manner, correct for attenuation, and check the properties of the 
resulting matrix. A second interesting possibility arises from DuBois’ finding. 
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If one applies the minimum rank model and iterates communalities, the 
pattern of the communalities will be revealing. If half as many factors as 
variables have been extracted, and if the communalities alternate between 
squared multiples and reliabilities, the matrix is a simplex and the factor 
model is inappropriate. Use of the pattern of communalities, however, is 
not restricted to situations in which the number of factors is n/2. For ex- 
ample, in a nine-variable simplex, iteration of communalities from two 
factors will result in the building up of the communalities of variables three 
and seven; use of three factors will lead to high communalities for variables 
two, five, and eight; with four factors, as DuBois has indicated, variables 
two, four, six, and eight develop the high values. 

There is also a very interesting empirical possibility that will distinguish 
between whether the individual has changed or whether the factorial compo- 
sition of the stimulus situation has changed. Assume that a typical “aptitude’’ 
test battery has been administered at the beginning of the freshman year 
and again at the beginning of the senior year. Grade-point averages year 
by year will form a matrix having the descriptive properties of the simplex. 
A critical finding for the hypothesis that the people have changed would be 
the obtaining of similar correlational patterns between freshman tests and 
freshman grades and senior tests and senior grades. 

The above possibility suggests that in factoring developmental data in 
the usual way a highly incongruous outcome would frequently be obtained; 
i.e., well-defined and well-matched factors from separate matrices in which 
time was held constant could not be identified as the same factors in a larger 
matrix in which time varied. The verbal fr or at age 6, for example, would 
not be the same as the verbal factor at 12 in a common factor analysis of 
developmental data although the definition of the factor would be unam- 
biguous in the age groups taken separately. 


Summary 


In summary I would like to list a few generalizations that may be a 
carrying with you when you leave. 

1. Correlational matrices having the simplex form shou!d be fouad 
very commonly in maturation and learning data. Several diverse examples 
have been presented. 

2. The finding of such a matrix imposes a problem in prediction that is 
currently unsolved. It is possible that prediction of remote performance can 
utilize information concerning the conditions under which growth or learning 
will take place. 

3. Reliability estimates are of critical importance in the analysis of 
simplex matrices, but are frequently difficult to obtain. If one is sufficiently 
confident that the variables do form a simplex, a reliability estimate can be 
obtained from the intercorrelations of the variabies. 
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4. The minimum-rank factor model should not be applied to simplex 
matrices. Communalities alternate between the squared multiple correlation 
and unity as a function only of the rank order of the variables. Psychologi- 


cally this is nonsense. 
One last generalization is a matter of faith. Both tomorrow’s weather 


and tomorrow’s behavior are predictable. 
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USE OF TRUE-SCORE THEORY TO PREDICT MOMENTS OF 
UNIVARIATE AND BIVARIATE 
OBSERVED-SCORE DISTRIBUTIONS* 


FrEpERIC M. Lorp 


EDUCATIONAL TESTING SERVICE 


Formulas are derived for using the available item statistics and score 
statistics on a test to estimate the moments of the score distribution of a 
lengthened (or shortened) form of the same test. Other formulas are derived 
for estimating the bivariate moments of the scatterplot between two parallel 
test forms using only the data available on either form alone. An empirical 
study is made showing in each case satisfactory agreement between the theo- 
retical values predicted from the formulas and the values actually observed. 
These results suggest the utility of the true-score model used in deriving 
the formulas. 


The set of scores obtained for a group of examinees on a given tesi 
represents not only the characteristics of the examinees tested, but also the 
peculiarities of the measuring instrument. Clearly, any adequate mental- 
test theory must be capable of separating the one from the other. A major 
step towards this separation is achieved when it is possible to replace the 
frequency distribution of the examinees’ observed scores by an estimated 
distribution of their true scores. 

Different methods for doing this have been summarized in [6]. Most 
methods require estimation of the moments of the true-score distribution 
and the fitting of a frequency curve to these moments. Primary consideration 
here will be given to the ztem-sampling method, previously called the matrizx- 
sampling method ({5], pp. 3-17), which requires only data from a single form 
of the test. Consideration also will be given to the model ([5], pp. 2-3) that 
assumes the errors of measurement to be distributed normally and inde- 
pendently of true score. 

On its face, the present paper is concerned with obtaining the answers 
to two practical problems. 


1. How will the shape (skewness, kurtosis, etc.) of the 
frequency distribution of observed test scores be affected by 
lengthening or shortening the test? 

2. How can the shape of the scatterplot representing the 
relation between observed scores on two parallel forms of a test 
be predicted from the data on a single form alone? 


*This work was supported by contract Nonr-2752(00) between the Office of Naval 
Research and Educational Testing Service. Reproduction in whole or in part for any 
purpose of the United States Government is permitted. 
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Although neither of these practical problems involves true scores explicitly, 
it is worthy of note that the theory about true scores provides immediate 
answers to each. Furthermore, these answers in turn provide an empirical 
check on the theory, since the theoretical answers can be compared with 
observed data on pairs of actual tests. (Some such device is often necessary 
for the operational verification of a theory about true scores, since the true 
scores themselves can never be observed directly.) 

The plan of the present paper is thus to use the item-sampling approach 
to true scores to derive formulas for answering the two practical problems 
stated previously. In the first problem, the moments of the observed scores 
on the shortened or lengthened test will be estimated from data on the original 
form of the test; in the second problem, the bivariate moments of the scatter- 
plot of observed scores will be estimated from data on a single test form. Both 
methods will be applied to actual test data so as to compare theoretical and 
actual results. These results also will be briefly compared with the theoretical 
results obtained under the assumption that the errors of measurement are 
distributed normally and independently of true score. 

The formulas used for estimating univariate moments are essentially 
the same as those in [6]. They are presented here in more systematic fashion 
as a special case of the bipolykays, which had been developed and named by 
Hooke [2] after the stimulus of an early draft of [4]. The generalized sym- 
metric means used here for estimating bivariate moments are a new but 
obvious generalization of the bipolykays. 


Derivations, Methods, and Formulas 


Estimating the Moments of Scores on a Lengthened or Shortened Test 


The test score used here will be the proportion rather than the number 
of items answered correctly. Each test is thought of as (at least effectively 
the same as) a sample of test items drawn at random from the same, very 
large pool of items. The true score of an examinee is here defined as the 
probability that an item drawn at random from the pool will be one that he 
answers correctly. Thus, two randomly parallel tests will, by definition, have 
identical true scores, regardless of the number of items in each. 

How can data on an n-item test be used to predict the shape of the 
frequency distribution of observed scores on a randomly parallel v-item test? 
The formulas in [5] provide ‘‘type-2-unbiased”’ estimates (to be denoted by the 
letter f with a subscript) of the true-score moments, and also of products of 
such moments; these estimates are functions of the observed test data. (The 
meaning of the qualification “type-2” is that it is the test items, not the 
examinees, that are being sampled.) The procedure to be used here will treat 
each observed f computed from the data on the n-item test as an estimate 
of the corresponding f for the v-item test. Since the moments of the scores 
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on the v-item test are simple linear functions of the v-item f’s, the desired 
moments of the v-item test can be estimated by this approach. 

The notation used is summarized below. The only random variables 
are x,, and functions of it; x,, is a random variable only when test items are 
being sampled, not otherwise. 


n (or v) 
N 
Loa 
1 
Z.=- 
n 
1 
a Se 
m= 
may 
TT, N 
wi a= 
n 
i 
paieale 
5, =. 
Wohi ot N 
bts 
a=W 
(ud 
"oh ae 


E, 


Il> 


.) 


2 


i) 
- 


M= 


xn 
am 


e 


Mz 
£ 
i~J 


e 


= 


2 
t] 


“*) 


a 


Mz 


= 


~ 


~ 


- 


Mz -M 


8 
.-) 
e 


S) 
oy 


VoaV ha 


is the number of items in the test. 

is the number of examinees tested. 

is the score of examinee a on item g; it is here assumed 
to be always 0 or 1. 

is the proportion-correct score of examinee a. 

is the rth raw moment of the proportion-correct 
score in the group of examinees tested. 


is the difficulty of item g—the proportion of examinees 
answering item g correctly. 


is the rth moment of the frequency distribution of 
the item difficulties (note that Mj = mj). 


is the proportion of examinees answering both 
items g and h correctly (note that 7,, = 7,). 


Loalrotig , etcetera. 


8 
J 
8 

x 
) 


£2 


2 
a 


is 1/N times the sum of the scores for the ex- 
aminees answering item g correctly. 


n(n — 1)n —2)---QM—rt+1) =nV/n—-D! 


indicates that the quantities that follow are to be 
summed over all values of each subscript omitting 
those cases where two or more of these values are 
equal; e.g., 

> >. Ta = Duta + >> 2, ‘ 

g=1 h=1 g=1 


expected value = average value over all possible 
randomly parallel tests. 


is to be read “‘is estimated by.” 
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E-* is to be read “an unbiased estimate of.” 
g. = Ez, is the true proportion-correct score of examinee a. 
My is the rth raw moment of the frequency distribution 


of true (proportion-correct) scores in the group of 
examinees tested. 


In the present context, a generalized symmetric mean (g.s.m.), denoted 
by f in Table 1 and defined in the equations given below, is a special case of 
the f’s defined by Hooke [2]. The f2o to fog used here correspond to Hooke’s 
t, to tio , respectively, not to his f2) to fos . The f’s here are a special case 
for two reasons: (i) the values of x;, are here restricted to 0 and 1, with the 
results that f,. and f,s through f.. are of lower order than the others and that 
Hooke’s f,; through fs; are identical with certain lower numbered f’s; (ii) 
the number of examinees (N) is here assumed to be so large that the group of 
examinees can be treated as tf it were the entire population. Each of these re- 
strictions could be removed, but it will usually be undesirable to do so because 
of the increased complexity of the resulting formulas. Moreover, it is probably 
not useful in most cases to try to estimate any true-score or observed-score 
moments beyond the second unless N is at least 500 or 1,000. 

In practice, the g’s are first computed from the raw test data. Then, with 
the help of Table 1, any f may be readily written as a linear function of the 
g’s. Table 1 provides the numerical coefficients in this linear function*; these 
are obtained by starting at the left of the table and reading across to and 
including the diagonal, but no further. Thus, nf.5 = gos ; n' "foo = ae — 
392s + Goo . Similarly, any g may be written as a linear function of the f’s 
by starting at the top of the table and reading down to and including the 
diagonal, all the numerical coefficients being positive. Thus, go. = "f25 + 
n'*\fo4 5 9s = foo + Tn fr, + 6n' fon + n' fy . 

- Equations defining the f’s and equations for computing the g’s are as 
ollows. 


n'*'f, = a 11 \em; = nA yrs 1 = n'mi' 
n't, = Demme, =n EO ye yt go = n'mims 
nf, = Do wee; gs = n'MM; 
nf = Dietary =n Bye? gx = n'my 
nlf, = Do. wom, gs = n’M;” 
nfo = Doe mont, = mE lus Jo = n'mims 
n't, = Don wim, gr = n° MMS 


*The writer is indebted to Ruth Bredon, who has checked most of the formulas 
presented in Table 1, and subsequently, by deriving them independently. 
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n'*"f, = ke Taig = Eo yl gs = n'mi 

nfo = Dom Jo = nMi 
nfo = pe WoW ork; gio = nm x, Moko 
nf, = 2 TAT ni gn = n’mMi 
ants po + 1,0 )0;, = nA 8 Ju = nmi? 
nhs a pe Tor oi I3 = n? 2 
ie = be TWorT Mh Ju = e bm WoT hT oh 
ae = pe ToriTg Is = n? >= Sige 
n' fie = pe iT oh Jis = 2 Zz Tey 
nf. = Din On 97 = D Dawn 
n' he = Don temas = 0B ulus fis = n°mims 
"h, ae pe Tor, Jig = n° M{M$ 
nf. = . a gan = nimi 

tin = Dow ga = nM; 
n'* foo = pe ToT oh J22 = 1 bt Ty 
n'?"fog = } TT, = ne a? Jos = 2 mi{” 
nf, = b T= n' EO yh Ju = nm 

Mfrs = Zz; ; Jos = nM} 

Nf oe su 7, a nE" ys Joe = nm = nM? 


9 is an estimated value of g for the v-item test. 

The procedure in dealing with the first of the two test-theory problems 
may now be fully outlined as follows. ; 

(1) Compute the g’s from the data on the n-item test. 

(2) Compute the f’s for the n-item test. Consider these as estimates 
of the f’s for the v-item test. 

(3) Compute the estimated raw moments and products of raw moments 
(91 » Jo. 9s+Go.9s» Giz» Gis » Goo» Gos » Jes » Joe) for the v-item test from the 
estimated f’s by treating the latter as if they were actually observed values 
on the v-item test. (I.e., use the coefficients on and above the diagonal in 
Table 1, replacing by v and g by @ throughout the table.) 

(4) Compute estimates of the central moments and of the cumulants 
of the v-item test by substituting the @’s into the usual formulas for moments 
and cumulants (ignoring terms of order 1/N). 

It is, of course, possible to use the procedure just described to obtain 
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TABLE 1 


Conversion Table for f ‘s and g ‘'s 











=] 
| 826| 825 | 82u| 25 | 820 B01 [E20 [9 [#28 [6x7 [#26 |®25 [814 |®13|812 82 |F0/& [8 [& [86 [85 [8 [65 fo [es | 
] 

















+2 -2 | -1 1 4 


-2 1 1 


+2 -1 | 


al3le | +2 -1 -1 -1 1 4 











nl@le | -1 1 2 4 


+2 -2 -1 1 6 

























































































altle | | -6 +8 +3 -6 1 








formulas expressing the estimated g’s for the v-item test directly in terms of 
the observed g’s on the n-item test. This has been done, and the resulting 
formulas have been summarized in Table 2, which shows, for example, that 
the estimate of g.» for the v-item test is 


1 
J2o = n'3! [(v — n)(2v — n)vge, — 3@ — n)u'* go, + v"Gao)- 
1 
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It is desirable in practice to use both tables of formulas so that each set of 
computations (and also each set of formulas) can serve as a check on the 


other. 

The use of the formulas may be illustrated by the following example, 
in which n = 60, v = 120 and the problem is to estimate the mean (m{) 
and variance (m,) of the longer test. The observed values of the g’s for the 


shorter test are 
Joo = nm, = 49.98, Jos = 43.013046, Jos = 2517.822, 
gos = nmi = gre = 2498.0004. 

By Table 1, 

fos = 9oe/n = 0.833, fes = 0.7168841, 

fos = (Goa — goe)/n'”! = 0.69713051, fs = 0.69849925. 
Again by Table 1, 

Goo = Vien = 99.96, Gos = Ufoe +.0'* fog = 10,054.9837, 

goz = 9,989.1954. 


The same values may also be obtained from Table 2. By the definition of the 
§’s, the moments about the origin for the v-item test are estimated by 


mi = goe/v = 0.833, ms = g/r? = 0.69826276; 


Il 


also 
me = §o;/v° = 0.69369412. 


Thus the estimated mean of proportion-correct scores on the longer test is 
the same as that on the shorter, and the estimated variance is 


m, = mi’ — mi? = .004568. 


This estimated variance for the longer test is presumably slightly larger 
than would have been obtained under the conventional assumption that the 
longer test is rigorously parallel to the shorter test, rather than randomly 
parallel, as assumed here. The exact value that would be obtained under the 
conventional assumption depends on the test reliability, which cannot be 
determined exactly (under the conventional assumption) from data on only 
a single form of the test. 


Estimating the Bivariate Moments of the Scatterplot Between Parallel Test Forms 


The procedure for estimating the bivariate moments for two parallel 
test forms is analogous to that of the immediately preceding section. Certain 
symmetric functions of the bivariate data are unbiased estimators for the 
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same true-score moments and moment-products that are estimated by the 
f’s computed from a single test form. The procedure here is therefore to 
consider the f from a single test form as an estimator for the corresponding 
bivariate symmetric function. Since the bivariate moments are linear func- 
tions of these bivariate symmetric functions, these moments may thus be 
estimated by a linear function of the univariate f’s. 

For example, the expected value of the average cross product (m{,) between 
proportion-correct scores (z, and Z,) on two parallel test forms is the second 
moment of the true scores, as is, of course, well known in standard test 
theory. For randomly parallel tests, this result is derived as follows. 


Eym), = Bay Lad. = 3 LD Bad. = yD Ex) Erz,) 
1 
= Dian. 


Now, the formulas given earlier in the present paper state that f., , computed 
from a single test form, is also an unbiased estimator for ui . Thus f.. = 
(nm — m{)/(n — 1) will here be taken as the appropriate estimator, obtain- 
able from a single test form, for the bivariate moment m{, . This is all rather 
similar to the standard procedure for estimating the correlation between two 
“rationally equivalent”’ test forms by means of a Kuder-Richardson reliability 
coefficient computed from the data on just a single form. 

Formulas for estimating all raw bivariate moments and moment- 
products up through the fourth order from the univariate f’s are given in 
equations (1) through (20). 


(1) mir = fo. 
(2) MioMer = fos « 
(3) mb, & © [ln — Whoo + fale 
(4) miami, = 2 [a — Whe + fos 
(5) mbomin & + [ln — Whe + fos 
(6) mismi, = 2 (0 = Whe + hel: 


(7) ty & 75 [ln — 1)fu + 2m — Whao + fall 
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(8) ms, = 75 fn!f, + Bn!" foo + fad 

(9 mit & A [n — fe + 2M — hs + ful. 

(10) abomts = = [ln — fa + 20 — has + fal 

(11) mommy = 5 (nf tn! fig + nl fis + Mf). 

(12) mbymby & 75 [(m — 1)"fo + ( — Dis + hes) + fa: 
(13) mmf & (nf, +!" fig + 2! fis + Mf). 

(14) ———mbgrmés & 7a (nf, + Bn fig + fos. 

(15) mhrmiomd, = [a — Ife + ln — fro + ful: 

(16) miMio = 7 (n' fo on fir + Qn" "fio + Mfrs). 


. 7 
(17) MioMioMi = 7 (n'*" fo + Qn" fio +n" fie + nfs). 


(18) boii = a(n — D"fp + — Din + @— Dfe + ful 
(19) mime = a [(m — D*fr + An — fs + fol 
(20) miss, S 5 (nf, + 3n!"fs + nf). 


The method of deriving these equations may be cursorily illustrated 
for the case of (4). Capital letters will be used to distinguish certain symbols 
relating to the second form of the test from the corresponding lower-case 
symbols relating to the first form of the test. The key principle in the der- 
ivation is that the expected value of a product of 1’s is the same whether the 
subscripts are all lower case or partly lower case and partly capitals, provided 
that two subscripts referring to the same test are not allowed to assume the 
same value. : 
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i< 1< 
E,miomt, = a(t z \(t he sds) 


b=1 
me ee ee 1 (23 G3 )| 
i li pe ¥ 5 > n 2s n 2 Xn 
= ae: Zz Dd rma 
n o=1 k=l I=1 


II 
co] > 
& 
= 
i 
4 
a 
rs) 
= 
4 
*—/M 
4 
a 
a 
ice 


1 
a [n?(n°— 1)E.x,m,, + n°E2x,1,1] 


: [(n 1)E.x,m,; + E,x,,:] 


1 1 1 
1p — 1) 5 Lewes + nia ee rr. | 


n 


E, : [(n — 1)fis + fee]. 


Equation (4) is a reasonable application of this last result. 

The estimates of the bivariate moments and moment-products can also, 
of course, be written out in terms of univariate g’s instead of f’s. Such formulas 
are given below, in order to provide a check on errors in the computations, 
and also on “typographical” errors in the formulas themselves. 








ee 
(1’) mii = [2] (Gos — G20). 
n 
, , td ca 1 
(2’) MioMoi = nil (G23 — ges). 
ak 
(3’) Mm, = n™n [(m — 1)g2o — (2n — 1) gos + Ng26]. 
ae 
(4’) MioMir = n™n [@ — 1)(gis — Ges) — NGo2 + ngos]. 
ae | 
(5’) MoM. = nn [(m — 1)(Gis — 2G22) — gos + Ngos]. 


Pp 





1 7 
(6’) mom = —— [(n — I giz — (2n — 1)gio + ngai). 
nn 
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I> 


ae [(n — 1)°g6 — An(n — gro 
+ (6n*? — 3n — 1)gu — n(8n — 1) 926]. 


ie 
mn = — a [(n — 1) — 2)gs — 3m — 1)’ G20 
nn 


+ (8n? — 2n + 1)goun — n(n + 1)go0]. 


72 
m1 


I> 


7 5 [Km — 1)°(g. — 2gi8 + gas) 
nn 
Yi 2(n* om, 1)(— 9:3 + 2922) + (n? —n+ 2) gz 
Pr. n(3n 7 1) 925]. 


ns 1 
MioMin = nn? [(n — 1)°(g — 4913 + 2917) 


— 4(n — I)gis + 4(n? — 1)g2+ m+ 1) 923 
_— n(3n — 1) gos]. 


ao , 
MioM{, = Tas [(m — 1)(m — 2)g, — (mn — 1)°(2g18 + gis) 


+ (n = 1)(2917 + Jos) + (8n? — in + 4) goo 
~ n(n + 1) gos). 


[> 


, , 
M21M, = 





1 
nn? {m —- 1)*96 = 2n'?'(g1s + 918) 


+ n(5n — 8)go2 + (n’ — 1)g23 — n(3n — 1)gos). 








mimi, = —A— [in — I(n — ge — nyu, 
nen 
(13’) — (n — 1)(2n — 8)gis + 2n7 goo + (n — 1)? G08 
— n(n + 1)go8). 
MsoMo, = =a [nm — Im — 2)(g. — 3915) 
nn 


(14’) — 3m — I)gis + 38n(n — 1)go2 + (n + 1)g08 
— n(n + 1)go5]. 




















(15’) 


(16’) 


(17) 


(18) 


(19’) 


(20’) 


, , , 
M4Mi9Mo1 


, 72 
MiMi 


, , / 
MoM 0Mo1 


, 72 
M2o™Mo1 


y2,. 92 
Mio™Mo1 


13,0 
Mi0™Mo1 


FREDERIC M. LORD 337 


= ees [(m — 1)*(g2 — gu — 912) 

nn 
+ 2(n? — 1)(—gio + gis) + (W? — 2 + ore 
+ (n — 1)(8n + gis — n(Bn — 1)gai]. 


- roe [(m — 1)(m — 2)(g2 — giz) — 2(n — 1)’ gro 


+(r- 1I)(—g1r + 2914) + (n? —n+ 2) O16 
+ (n — 1)(2n — 1)gi9 — n(n + 1)gas]. 


~ 4 [(n — 1)(n — 2)(g2 — gir) 


+ 2(n aA 1)’(— gro + 916) + (n — 1)(—912 -+- 2914) 
+ (n’? + Igie — n(n + gail. 
= a [(n — 1)’(92 — 4910 + 2g) 
— 2(n — 1)(+ 91 + G12) + 2(n* — 1)gis 
+ (2n — 1)(n + 1)gio — n(Bn — 1)ga,]. 
= es [(n — 1)’g, — 4n'*' gs 
+ (2n? — 3n + 3)95 + 4(n” — lg — n(8n — 1)g]. 
= [im - Im — Do. — 3m — 1% 


4 2 
n'y 


+ 3(n — 1)gs + (8n” — 5n + 4)g, — n(n + 1)Q]. 


Estimates of the bivariate central moments are obtained by substituting 
estimated values from equations (1) through (20) for the monomials on the 
right of the following. 


(21) 
(22) 


(23) 


(24) 


Pees , 
My = Mi — MMi . 


27 
Mo, = M3, — ZM{iyMio — MoM, + 2ZMiomMa - 


2 


, , , 
Mar = M3, — 3M, M{o + 8M{, Mio + IMsoMioMi — MoM 


— 3mismi, . 


2 2 72 
Moe = Ms — 4m3,mM, + 4m{,M{oms, + 2mMI0MI1 — Moms: « 
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2 3 
(25) MooM1, = M3oM{, — MpMioMo, — MI,Mig + Moms « 
) 2 2 2 
(26) MopMo2 = MoMig — ZMi9Mh, + MoM: « 
ES 72 , ’ , 72,92 
(27) Mir = Mi — ZMiyMioMi, + MigmM . 


Estimates of the bivariate cumulants, if desired, are similarly obtained 
({3], eq. 11.109; [1], p. 183) from 


(28) Ky, = My, 

(29) Ko = Mn , 

(30) Kg1 = M3, — 3Moa9™M, , 

(31) Koo = Moz — MooMon — 2M , 


terms of order 1/N being neglected. 


Data and Procedure 


A 150-item vocabulary (synonyms) test had been administered to a 
nationwide sample of about 13,000 college and university seniors. A very 
few examinees who did not reach item 144 were excluded from the study, as 
were items 145-150. 

Twenty-four of the items were selected to form a “control test.’”’ This 
was used only to select two groups of 1,000 examinees each: 

Group H— examinees with control-test scores from 22 to 24, 

3roup [—examinees with control-test scores from 0 to 8. 

The remaining 120 items will be denoted as Test R. These items were 
divided at random to form two 60-item “randomly parallel’ tests, denoted 
as Test P and Test Q. The answer sheets were then scored and the necessary 
item analysis data computed for each of these three ‘‘tests.’”’ The score dis- 
tributions for both tests and both groups are shown in Table 3. It is seen 
that the score distribution of Group H is quite negatively skewed and that 
of Group L, slightly positively skewed. 

The necessary moments, cumulants, and g’s were computed. For this 
purpose, proportion-correct scores were used—not number-correct scores. 
The moments and cumulants of Test R were estimated from the g’s of Test 
P, and the estimated and observed cumulants were then compared. Also, 
the bivariate cumulants between P and Q were estimated from the g’s of 
Test P, and the estimated and observed values compared. The above was 
repeated with Test Q substituted for Test P. 

Computational note. It was hoped that all quantities of order 1/n’ and 
of higher order in Tables 1 and 2 and in equations (1) to (20) would be 
negligible and could be ignored in the present computational work where 
n = 60 (or 120). This would avoid the necessity for computing fis , fie , and 
fiz , which require especially onerous computing. All computations were done 
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TABLE 35 


Number-Correct Observed-Score Distributions 
for Group H and Group L 














Number- Group H Group L 
correct 
observed Test Test Test Test 
score P Q P Q 
59-60 5 12 
57-58 39 63 
55-50 12 135 
55-54 146 190 
51-52 188 187 
49-50 180 165 
47-48 126 105 a 
45-46 88 59 i 0 
45-44 60 46 1 2 
41-42 27 20 1 O 
39-40 13 E 0 3 
37-38 9 9 2 3 
35-36 4 2 4 6 
33-34 2 9 > 
31-32 1 5 32 
29-30 31 43 
27-28 4O 52 
25-26 70 88 
23-24 o7 83 
21-22 103 95 
19-20 129 105 
17-18 136 120 
15-16 115 96 
13-14 98 
11-12 64 66 
9-10 49 46 
7-8 30 33 
5-6 4 19 
3-4 b 5 
1-2 1 
Total 1000 1000 1000 1000 





using all the terms in the formulas, however, since there is no definite proof 
that it is safe to neglect the smaller terms even with n as large as 60. It is not 
surely known at present how large n must be before such terms may safely 
be ignored. 


Ystimating the Cumulants of a Lengthened Test 


The first, second, and last columns of numbers in Tables 4 and 5 give 
the observed cumulants (x,) of Tests P, Q, and R, respectively. Actually, 
the rth root of x, is given in order to avoid numbers with 4 or 5 zeros after 
the decimal point. [From certain points of view, it would be desirable to 
present relative cumulants (x,/«;/”) rather than x, . This is not done, how- 
ever, since the comparison described in the next paragraph requires x, itself.] 
The fourth and fifth columns of numbers give the estimates of the cumulants 





340 PSYCHOMETRIKA 


of Test R obtained via the equations in Tables 1 and 2. (The value V ks = 
.068 in the fourth column of Table 4 is the square root of the quantity 
m, = .004568 obtained in the numerical example given earlier.) Each number 
in the third (sixth) column was obtained by first averaging the corresponding 
pair of observed (estimated) cumulants for Tests P and Q and then extract- 
ing the appropriate root. 

In most cases, there seems to be good agreement between the estimated 


TABLE 4 


Observed and Estimated Univariate Cumulants 
of the Proportion-Correct Scores for Group H 





Observed values Estimated values 


for 00-item tests 





for 120-item test 





























Observed 
Esti- Esti- value 
Test Test Aver- mated mated Aver- for 120- 
P Q age* from P from Q age* item test 
Ky -833 850 841 833 -850 841 841 
vee -O74 -O71 073 068 -064 -066 066 
a -.064 -.061  -.062 -.059  -.055 -.057 -.058 
a .063 056 —-.059 .058 050.055 -056 
* 
Not an arithmetic average; see text. 
TABLE 5 
Observed and Estimated Univariate Cumulants 
of the Proportion-Correct Scores for Group L 
Observed values Estimated values 
for 00-item tests for 120-item test 
Observed 
Esti- Esti- value 
Test Test Aver- mated mated Aver- for 120- 
P Q age* from P from Q age* item test 
ant 314 - 320 317 314 -320 0517 317 
V*>5 -102 116 -109 095 109 -102 102 
VR .073 073 —-«.073 069 = .068.-—t—«=l OB .070 
4 
Ve, O77 033 -065 -079 -058 -O71 -073 





* 
See text. 
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and actual values for Test R in Tables 4 and 5. The standard errors of the 
estimated values have not been worked out, so discrepancies cannot be 
compared with expected sampling fluctuations. However, some idea of the 
adequacy of the estimated values can be obtained for Vx, and Wx, by 
comparing the numbers in the fourth, fifth, and sixth columns with those in 


TABLE 6 


Observed and Estimated Bivariate Cumulants of 
Proportion-Correct Scores for Group H 























Observed uni- Estimated bivariate 
variate cumulants cumulants 
Esti- Esti- Observed 
Test Test Aver- mated mated Aver- bivariate 
Q age* from P from Q age* cumulants 
V«5 -O74 O71 .073 Vey -060 -057 059 +059 
Ves -.064 -.0601 -.062 Sa -.057 - .053 -.055 - .056 
4 i ) 
Visi -058 -050 2054 056 
4 
Ve), -063 056 -059 
4 
Ve 55 -057 2049 2053 056 
*Not en arithmetic average; see text. 
TABLE 7 


Observed and Estimated Bivariate Cumulants of 
Proportion-Correct Scores for Group L 

















Observed uni- Estimated bivariate 
variate cumulants cumulants 
Esti- Esti- Observed 
Test Test Aver- mated mated Aver- bivariate 
4 Q age* from P from Q age* cumulants 
102 -116 -109 Vea -087 -102 095 095 


-073 2073 -073 -067 -066 .067 -069 


Ve, -O77 +033 065 


2 
V*o1 
K 079 05 071 074 
4 =" 8 
4 
NV ¥20 


O79 -062 O72 e075 





* 
See text. 
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the first, second, and third, respectively. The reason is that if the errors of 
measurement were normally distributed, independently of true score, then 
([5], pp. 2-3) the rth cumulants (r ~ 2) of observed scores on Tests P, Q, 
and R would all be equal to the same quantity (the rth cumulant of the true 
scores) and hence to each other. Examination of the tabled values shows that 
in each of the last two rows the numbers in the sixth column provide slightly 
better estimates than do those in the third column of each table. This result 
is in accord with the results of another study [7], which presents significance 
tests showing that the errors of measurement in these data are distributed 
neither normally nor independently of true score. 

The assumption of normally and independently distributed errors yields 
fairly good approximate estimates in the present case where the number of 
items in each test is 60. The disadvantage of these estimates in comparison 
to those in the fourth, fifth, and sixth columns will in general be greater 
whenever the number of items available for making the estimates is smaller. 


Estimating the Cumulants of the Scatterplot Between Parallel Forms 


The last column of Tables 6 and 7 gives the observed bivariate cumulants 
for Tests P and Q. The three columns immediately preceding show the 
corresponding estimates obtained from equations (1) to (31); there seems 
to be satisfactory agreement between these estimates and the observed 
values. 

Ifthe errors of measurement were normally distributed, independently 
of true score, then any bivariate cumulant x«,, (r + s # 2) would be equal 
to the corresponding univariate cumulant «,., . Examination of the average 
values in the last two rows of each table shows that the estimates obtained 
from equations (1) to (31) (sixth column) are as good as and in most cases 
slightly better than those obtained under the assumption of normal and 
independent errors of measurement (third column). 
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A new paired comparison method, based upon choices between lotteries, 
is developed for the measurement of utilities of objects with respect to the 
utility of receiving nothing, i.e., the status quo. The method is used to esti- 
mate the utilities of four ees spec at geo These objects had also been studied 
in an earlier experiment which choices between single objects and pairs 
of objects to determine a rational origin. A comparison of the results of the 
two experiments indicates that both methods scale objects with respect to 
the same rational origin and unit of measurement. 


From each of two distinct traditions has developed an interest in measur- 
ing the value or worth to an individual of an object or event. Psychological 
scaling theory, beginning with the work of Thurstone [8], phrases the problem 
in terms of measuring the subjective value of objects. From another approach, 
dominantly economic, has come an interest in measuring wfility [5, 7]. It 
would appear that subjective value and utility may reasonably be given 
common definition. In each case, the purpose is to assign numbers to events 
to represent relative preference of an individual for those events [cf. 4]. 


The Rationale of Utility Measurement 


Contributing to the current interest in utility and decision theory was 
the publication in 1943 of von Neumann and Morgenstern’s Theory of Games 
and Economic Behavior [12]. There it was proposed that utility might be 
measured up to a linear transformation by offering an individual a series of 
choices between two options, (i) the certain receipt of object B, or (ii) a 
lottery resulting in the receipt of object A with probability p or the receipt 
of object C with probability (1 — p). For all such choices it is assumed that 
object A is preferred to object B, B is preferred to C, and A is preferred to C. 
The probability p is varied from choice to choice until that probability po 
is found at which the individual is indifferent between (i) receiving B, and 
(ii) receiving A with probability p, or C with probability (1 — po); i.e., 
u(B) = pou(A) + (1 — po)u(C), where u(X) represents the utility of object 
X. If numerical values are assigned to any two of the utilities the third is 
determined. For instance, if w(A) is arbitrarily assigned the value 1 and u(C) 
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the value 0, then u(B) = p, . The utility of each of a larger set of objects 
may be determined by adding to the study choices involving the additional 
objects (if the individual’s preferences satisfy certain consistency require- 
ments [5]). 

If a series of chvices between options with certain outcomes and options 
with risky outcomes is offered to an individual who, say, objects to gambling 
(i.e., for whom the utility of the act of gambling is negative) the estimates of 
the utilities of the outcomes would be biased since the individual would 
tend to prefer the certain options. This possibility can be eliminated by 
using only choices between symmetric lottery options, e.g., object A with 
probability p and C with probability (1 — p) or object B with probability 
(1 — p) and C with probability p. As before, p may be varied until that p, 
is found for which pou(A) + (1 — po)u(C) = pou(C) + (1 — po)u(B). Again 
setting u(A) = 1 and u(C) = 0 it is now found that u(B) = po/(1 — po). It 
should be noted that even if the utility of the act of gambling is a function 
of the probabilities in a lottery option it should not influence the individual’s 
choice between these lottery options because the same probabilities are used 
in both lotteries and, thus, the gambling utility appears as the same constant 
on both sides of the equation. However, if the utility of gambling is a function 
of both probability and utility, the utility estimates will be biased and 
inconsistent. In fact, the existence of such a utility of gambling would negate 
the usefulness of the von Neumann and Morgenstern concept of utility. 

Suppose that instead of considering the utility of each of a set of objects 
for a given individual, we conceive of distributions of these utilities in a 
population of individuals. For the estimation of parameters of these distri- 
butions, it is neither necessary nor feasible to vary p independently and (in 
theory) continuously for each individual. Each individual in a sample may 
be offered the same set of choices at each of r levels of p, so that the propor- 
tion, P, of individuals preferring, say, the lottery yielding A with probability 
p, and C with (1 — p,) to the lottery yielding B with probability (1 — p,) 
and C with p, may be obtained. If the form of the function relating P and p, 
were known, the best fitting line could be determined and p, corresponding 
to P = 1/2 estimated by interpolation. The method of contingent paired 
comparisons described b»i:w is based upon this rationale, but the inter- 
mediate stage of solving for po is bypassed in the estimation procedure. 

Notice that if p is set equal to 0 or 1, the choices involving lotteries as 
described above degenerate into choices between certain options, i.e., the 
psychophysical method of paired comparisons. Paired comparisons may be 
characterized as decision making under certainty; contingent paired com- 
parisons, with 0 < p < 1, as decision making under risk [5]. 

It should be noted that the described measurement procedures yield 
utilities which are unique only up to a linear transformation, i.e., numerical 
values must be assigned to two of the utilities before the others can be de- 
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termined. This lack of uniqueness can be remedied by adopting two states 
as standards to be included in any experimental determination of utility. 
One of the states should define the rational origin or zero point of the utility 
scale; the other, the unit of measurement of the utility scale. If this were 
done, then, presumably, the numerical values obtained in different experi- 
ments would be directly comparable and predictions could be derived for a 
wider class of behavior. 

The status quo of an individual would appear to provide a reasonable 
rational origin. This state is, of course, analogous to receiving nothing as the 
outcome of a lottery, i.e., a zero increment in total wealth. Thus, in the 
method of contingent paired comparisons the utility of the status quo is set 
equal to zero. The validity of this rational origin is open to empirical veri- 
fication. The choice of the other state which serves to determine the unit of 
measurement is apparently quite arbitrary. However, in the method of 
contingent paired comparisons, as in the method of paired comparisons, the 
unit of measurement is set equal to the standard deviation of the utilities in 
the population of individuals. In order for this to be a meaningful unit of 
measurement and for the method of solution given below to hold, the variance 
of the distribution of utilities must be the same for every object. 


The Model 


The following assumptions are basic to the development of the aggregate 
model for choice between lotteries. 


(a) Between alternative options, the individual chooses that one which has 
the greatest expected utility. 


(b) The utility of the act of gambling does not enter asymmetrically into the 
decision between alternative options and is independent of the utilities 
of the outcomes. 


(c) The utility of the status quo is the same for all individuals. 


(d) Subjective probability of an outcome is equal to objective probability 
of the outcome and thus is known a priori. 


(e) The form of the joint sampling distribution of utilities over individuals 
is known. (A multivariate normal distribution function is assumed in the 
present study, with equal variance and zero covariance.) 


In this study there is no attempt at direct and separate empirical verification 
of these five assumptions, although (a), (b), (d), and (e) jointly must be valid 
at least to a first approximation in order for reasonable results to be obtained 
with the method. 

A parametric solution for contingent paired comparisons is given below. 
A special restriction is placed on the probabilities used in the lotteries so 
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that the observed normal deviates may be summed across probability levels. 
The resulting paired comparison table can be solved by the usual method to 
obtain provisional scale values measured from an arbitrary origin. The 
independent observed normal deviates are then summed for each probability 
level. The distance of the arbitrary origin from the rational origin is esti- 
mated by the regression of these sums on a simple function of probability. 
This estimate of the distance is added to each of the provisional scale values 
to obtain absolute scale values, i.e., estimated average utilities measured 
from a rational origin. 

Let U; and U; represent the utility of outcomes 7 and j, respectively, 
for a particular subject. Then if given a choice between the two outcomes, 
the subject would prefer outcome 7 to outcome j if U; > U; , and would 
prefer outcome j to outcome 7 or be indifferent otherwise. 

Let the joint distribution of U; , U; in the population of individuals 
be bivariate normal, i.e., f(U; , U;) = N(ui , 4; , 0%, 0; , 043). Then Uy,;) = 
U; — U;, is normally distributed with mean »; — pw; and variance of + 
a; — 2¢,;; . The proportion of individuals preferring object 7 to object j, 
P.;,;) , corresponds to the area under this curve from 0 to © and can be 
obtained from a table of normal deviates by finding the proportion associated 
with the normal deviate 





a Mi — BH; 4 
(1) fan coed (a; + o; ae 20." 


This is consistent with the usual formulation of the paired comparisons 
scaling model [e.g., 11]. 

In the contingent paired comparisons situation employed in the present 
study, the alternative options are (i) receipt of object 7 with probability 
p, or “nothing” with probability (1 — p,); (ii) receipt of object 7 with prob- 
ability (1 — p,) or “nothing” with probability p, . If the utility of the status 
quo is equal to a constant value, 7, for all subjects, then the difference between 
the expected utilities for the alternative options in a contingent paired 
comparison item is normally distributed with mean 


(2) Boake = Des + (1 — prey — pry — (1 — De) Bi 
= pis — (1 — paws + (1 — 2p.) 

and variance 

(3) Ots.ayn = Di; + (1 5 i Di) 03 — 2p,.(1 ead Dr) Os; 


The unit normal deviates associated, in the population, with mean utility 
differences for the two alternative options have the composition 


(4) Pik ae Pit: — (1 — pw, + (1 — 2p.)Y_ 





TUs,i)k 
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Assuming as in Case V of Thurstone’s law of comparative judgment 
o; = o; = 1, and assuming o;; = 0, for all ¢, j, (3) becomes 
(5) Cay" ee De +(1- Dr) = Cy. 


Substituting c, for o;;,;), in the denominator of (4), multiplying both sides 
of (4) by c, , and summing over all probability levels k, yields 


(6) 2 cab c.so0 = 2 Ps — By; > (1 —p.) +y > (1 — 2p,). 


We now introduce an important restriction upon the design of the 
experimental study, namely that, for the r values selected for p, , 


(7) 7D, ~%. 


Tr kel 


Under this restriction, 


8) Ym= Ld-p=1/2 
and 
(9) p2 (1 = 2p) = 0. 


Substituting (8) and (9) in (6), 
2s Café. a00 = 5 (Ms me a), 
k=1 

or 


2 r 
r 


k=1 


(10) ge hy Cal cisirk « 


Note that (10) implies a skew-symmetric matrix of summed (weighted) 
normal deviates >>, ¢:f¢,;)x , Since (10) demands that 


p> Glu.ne = — > CeO ak 
and that 
oe Sa = 0. 


k 


Except for the known constant 2/r, the parametric form of the contingent 
paired comparisons under restriction (7) is the same as that assumed in 
ordinary paired comparisons. This fact will be used later (equation 10a) to 
express the relative preference value for the ith object. 
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TABLE 1 


Matrix Elements to be Summed in Equation (11) 





HP) - Blt-p eylinen) y, . eehene 


Pam athe, 24 RCRD ee, a ea PR at ke, al eee 


o o fl. @ of (1- s -p.) a 
BLP - Hy(l-py) + yil-2p,) wp - wR(l-p) + yil-2p,) 0. wee HP) > Hy_(l-p) + y(l-2p,) 





Now for es vel of p, write only those elements of the matrix cf ¢;,;)% 
below the di»<.nai, where i > j, as in Table 1. The sum of these elements, 
forz > jis 
i-1 n n—? 


; 7 Supe = (n — I)p, 2m — do — Dus 


j=1 


A\s 


(11) Cy 


7 


Ld 


+ mee} (1 — 2p,)y. 


Each value, »; , in (11) may be considered a sum of two components, one the 
scale distance, ui , of the ith stimulus from an arbitrary provisional scale 
origin, the other a constant, 5, representing the distance of that arbitrary 
origin from the rational origin. That is, 


(12) Ms = pit 6. 
Let the provisional origin be given by the restriction 
(13) ut = 


i=1 


Substituting (12) and (13) in (11) and simplifying, 


() a DL tuo=— Lode + Ba - wy - 9. 


t=2 j=1 


Replacing parameters by estimates, (14) becomes 


(m0) «> Bivisn ae -¥ in — pms +"@ =D 1 — apd — 8), 


faa ft 
where m/ estimates uj and 2;,;), is the normal deviate associated with 
P,;)x , the sample proportion of choice of the option which includes out- 
come 7 with probability p, over the option which includes outcome j with 
probability (1 — p,). 
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Equation (14a) is a linear expression denoting the random variate 
Cx >.> .2c:,;)x and the fixed variate (1 — 2p,) fork = 1,2, --- , r. The quantity 
— >orzi (n — jm , which does not involve k, is a location constant of 
no particular interest. On the other hand (y — 4), the slope of the line repre- 
sented by (14a), is of interest. It is the amount which must be added to 
transform relative utilities into absolute utilities, taken from the point which 
represents the utility of the status quo. With fallible data the constant (y — 4) 
may be estimated by plotting c,>> > 2::,;). against (1 — 2p,) for all k and 
determining graphically the slope of the line of best fit. Alternatively, if we 
wish to minimize the squared discrepancy between the observed Crd yei idk 
and those predicted from (14a) the usual least squares solution yields 


2 >> (1 — 2p,)er > y 28, a)k 
(15) diate, Japan eager 
n(n — 1) > (1 — 2p,)? 


Note that, due to (7), the mean of (1 — 2p,) is zero. Allowing y to be zero, 
then d may be found from (15), after multiplying both sides of the equation 
by — 1. Adding d to each m/ yields an estimate of absolute utility, mj; (see 
equation 12). The m/ values are determined from the sample analogues to 





(10a) mi — mi = : Dike. 
Yr kel 
but 
(13a) Lm; = 0; 
j=1 
thus 


n n 


2 
ae oa Cre (4,7) k 


NY jul kal 


(16) m, = 


and the estimated absolute value for object 7 is 
(12a) m? = m, +d. 


It should be noted that only the parametric solution for the contingent 
paired comparisons model has been discussed here. It is not difficult, how- 
ever, to derive the normal equations for least squares estimation of the 
utilities of the objects. One finds that even for a single level of p, the matrix 
of coefficients of these equations is nonsingular (unlike conventional paired 
comparisons). Thus in principle one may estimate the absolute utilities 
directly without the restriction }\p, = r/2. The much greater simplicity 
of the parametric solution recommends it for practical use, particularly since 
the restriction on p, is easy to meet. However, use of the restriction may be 
wasteful in some applications, particularly if average utilities of the objects 
differ extremely. 
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The Experiment 


A questionnaire was administered to 146 male students in General 
Psychology classes at the University of North Carolina. Data for the 141 
subjects who responded to all items in the questionnaire form the basis for 
analysis. 

An outcome of each lottery in the questionnaire was the (pretended) 
receipt of a gift. Each student was presented with a form containing line 
drawings and catalogue descriptions of the four gifts: a record player, a pen 
and pencil set, a brief case, and a desk lamp. (Subjects had previous ex- 
perience with the gifts since they had rated their preference for these and other 
gifts on successive intervals schedules immediately prior to the adminis- 














































































































Fiaure 1 
Sample Item from Birthday Gift Questionnaire 
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tration of the questionnaire described.) Subjects were instructed to ‘‘assume 
that you do not already own any of the articles, and that they are gifts for 
your own personal use, which you may not sell.”’ 

Each of the 34 items consisted of two alternative lottery options, repre- 
sented by two rectangles, each containing five squares. In some squares 
appeared identical line drawings of a gift; the other squares were blank. 
Subjects were instructed to imagine that the five squares within each rectangle 
were tickets which would be placed in a box and thoroughly mixed. One ticket 
would be randomly selected and the subject would receive as a gift the article 
pictured on the ticket. If a blank ticket were drawn he would receive nothing. 
A sample item appears as Figure 1. 

The contingent paired comparisons items consisted of a lottery offering 
gift 7 with probability p and nothing with probability (1 — p) and a lottery 
offering gift j with probability (1 — p) and nothing with probability p, where 
p takes on the values 1/5, 2/5, 3/5, and 4/5. The four probability levels and 
six possible pairs of gifts yielded 24 options of this type (Table 2). These 
contingent paired comparisons items are used to measure the average utilities 
of the gifts with respect to a rational origin, the utility of the status quo. 

The remaining 10 items provided internal checks on the scaling model. 
Six of these were paired comparisons items, i.e., one lottery offered gift 7 with 
probability 1, the other lottery, gift j with probability 1. Each of the other 
four items contained all four gifts as outcomes, i.e., one lottery offered gift h 
with probability p and gift 7 with probability (1 — p), the other lottery, 
gift 7 with probability (1 — p) and gift k with probability p, with p taking 
on the values of 1/5, 2/5, 3/5, and 4/5. 

The subjects indicated their choices by placing a check mark in a small 
box above the desired option (see Fig. 1). The proportion of subjects who 
chose the first lottery over the second lottery was determined for each item 
and transformed to a normal deviate. The average utility of each of the four 
gifts was estimated from the contingent paired comparisons data by using 
the parametric solution, equations (12a) and (16). The estimates are given 
in Table 3. 

To evaluate the consistency of the data, the estimated utilities were 
used to reconstruct the normal deviates and thus the proportions for the 
contingent paired comparisons items. Table 2 gives the observed proportion 
(Col. A), the reconstructed proportion (Col. B), and the absolute difference 
between the two (Col. C) for each of the 34 items in the questionnaire. The 
average absolute difference for the 24 contingent paired comparisons items 
is .056. 

A smaller average absolute difference, .051, is found for the six paired 
comparisons items even though they were not used to estimate the utilities. 
Probably this occurs because fewer assumptions are required for the paired 
comparisons model or, alternatively, because a paired comparisons judgment 
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TABLE 2 


Comparison Between Observed Proportions and Reconstructed 
and Predicted Proportions 























Opti A B Cc D E 
Iption ots 
I bin u Observed Reconstructed Error Proportion Error 
proportion proportion [A-B| predicted from [10] |A-D] 
.2A4 .8B . 348 . 266 . 082 172 . 176 
4A .6B . 830 . 672 . 158 . 564 . 266 
6A .4B . 936 . 946 -010 - 920 -016 
8A .2B - 964 -997 033 - 990 026 
2A . 8C - 447 - 504 .057 . 532 .085 
4A .6C . 894 . 839 055 . 852 042 
.6A .4C -972 -976 . 004 .971 -001 
.8A .2C - 986 995 .009 995 009 
.2A .8D . 220 .131 .089 -294 074 
4A -6D . 688 - 492 . 196 . 695 . 007 
6A .4D - 908 - 907 -001 949 -041 
8A .2D -972 . 989 -017 992 - 020 
2B . 8C . 234 . 364 . 130 . 432 . 189 
4B .6C - 518 . 568 . 050 . 662 . 144 
6B -4C . 823 . 770 053 854 .031 
8B .2C - 908 . 874 . 034 - 930 022 
2B .8D .057 - 069 -012 . 208 «a5! 
4B .6D . 220 211 . 009 454 234 
6B .4D . 553 . 534 -019 759 . 206 
8B .2D . 844 . 805 . 039 - 907 . 063 
2c -8D - 007 - 050 - 043 . 142 035 
.4C -6D .078 « R82 044 241 - 163 
.6C .4D - 390 . 324 . 066 . 427 037 
.8C .2D - 730 . 589 .141 . 616 114 
(Average Error) 056 094 
A B . 894 . 854 . 040 . 788 - 106 
A Cc - 950 - 935 - 105 - 939 O11 
A D . 872 . 754 . 118 - 863 - 009 
B Cc . 752 .677 075 773 021 
B D - 390 - B57 033 .617 227 
Cc D . 227 . 204 023 . 326 -099 
(Average Error) -051 079 
-2A, .8 . 2B, .8D . 447 292 . 155 . 403 044 
4A, ed - 4B, .6D . 624 459 . 165 - 527 .097 
6A, .4 .6B, .4D - 830 - 660 - 170 . 660 - 170 
-8A, .2C -8B, .2D . 879 - 794 . 085 . 747 . 132 
(Average Error) . 144 ona 
TABLE 3 
Utilities for Four Birthday Gifts 
Contingent paired Compound paired 
Gift comparisons comparisons 
(Present study) (Thurstone and Jones) 
m? M° 
i i 
A Record player 2. 84 2.81 
B Pen and pencil set 1.35 1, 68 
C Brief case .70 62 
D Desk lamp 1. 87 1.26 
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is easier to make than a contingent paired comparisons judgment. This 
latter explanation is supported by the finding of a much larger average 
absolute difference of .114 for the more complex items involving all four gifts. 


External Evidence for the Model 


Thurstone and Jones [10] report results from a compound paired com- 
parison determination of a rational origin based upon group preferences for 
each of five birthday gifts, four of which corresponded in every detail to those 
employed in the present experiment. Subjects in the Thurstone and Jones 
study were 194 male undergraduate students enrolled in the School of Business 
Administration of the University of North Carolina. Data were collected 
during 1952-53, five years earlier than the present study. The birthday gifts 
were scaled with respect to a rational origin by having the subjects respond 
to paired comparisons between single gifts and/or pairs of gifts, determining 
the provisional scale values of single gifts and of pairs of gifts, and then 
solving for the additive constant required to make the sum of the scale 
values of the single gifts equal to the scale value of the pair of gifts. 

Since four gifts are common to the two studies, direct comparisons of 
results from the alternative methods is possible. Estimates for the absolute 
utilities of the four gifts are presented in Table 3 for both compound and 
contingent paired comparisons. A plot of one set of scale values or utilities 
against the other (Fig. 2) demonstrates that a unit-slope, zero-intercept line 
adequately fits the data. It is noted, however, that a reversal in order occurs 
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FIGurRE 2 


Contingent Paired Comparisons Scale Values, m? , versus Compound Paired Comparisons 
Seale Values, M{ 
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between scale values for the pen and pencil set and the desk lamp, the former 
being higher for the 1952-53 sample. 

In 1952 the pen and pencil set was priced at $19.75. While by 1958 a 
comparable item was available at $22.00, it may safely be assumed that 
the ball-point pen market had completely altered consumer attitudes toward 
fountain pens. Nationally advertised, attractively designed pocket pens 
became very popular at prices between $1.98 and $5.00; such pens boxed 
with matched mechanical pencils were available at prices between $3.50 and 
$10.00. These factors would lead us to anticipate a smaller utility for the 
pen and pencil set in a 1958 population of consumers than in 1953. 

In the 1958 sample, not only had the scale value for the pen and pencil 
set fallen as expected, relative to 1953 results, but the scale value for the 
desk lamp had markedly increased. While the price of the fluorescent desk 
lamp had increased by ten percent, this alone seems insufficient to account 
for the observed effect. However, a parametric decrease in the utility of the 
pen and pencil set, over the interim period, could result in an apparent 
increase in utility for the item with most nearly equivalent popularity, the 
desk lamp; this effect upon estimated scale values occurs when only a small 
number of items are studied, as in the present case, when the parametrically 
altered distribution overlaps considerably the nearest distribution and over- 
laps only slightly the remaining, more extreme distributions, and when 
extreme proportions are less effected by the parametric change than would 
be predicted by the scaling model. This last condition would result if a small 
subgroup of subjects were inattentive, or if for other reasons their responses 
were inconsistent with the model. 

Having noted changed conditions sufficient to account for the observed 
discrepancies, the congruence between results from the two studies is con- 
sistent with the conclusion that the two sets of scale values have the same 
unit of measurement and rational origin. 

If the utility of receiving nothing is taken to be zero, then the Thurstone 
and Jones scale values can be used to predict the proportions of choice for 
the questionnaire used in the present experiment. These predicted propor- 
tions and absolute differences between observed and predicted proportion 
are given in Cols. D and E, Table 2. For the 24 contingent paired com- 
parisons items the average error of .094 using the Thurstone and Jones scale 
values is, of course, larger than the average error of .056 using the fitted scale 
values. These two values might be contrasted with an average error of .305 
which is obtained when the utilities are assumed to be equal and a proportion 
of .500 is predicted for each item. Clearly, using the Thurstone and Jones 
scale values in the expected utility equations and assuming that the differ- 
ences are normally distributed, results in predictions which would, in at least 
some applications, be considered worthwhile. On the other hand, using the 
fitted scale values to reconstruct the proportions results in “predictions” 
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which, while somewhat better than those obtained with the Thurstone and 
Jones scale values, are not good enough to justify the elimination of alternative 
assumptions (e.g., regarding the relation between subjective and objective 
probability) or even to justify the elimination of alternative scaling models 
(e.g., that proposed by Luce [4]). 


Conclusions and Implications 


The results of this experiment suggest that the method of contingent 
paired comparisons and the compound paired comparisons method used 
by Thurstone and Jones are relatively consistent techniques for measuring 
the subjective value or utility of objects on a scale with a common unit of 
measurement and rational origin. There is no assurance that complete in- 
variance would obtain over a wider class of experimental situations and 
measurement models which involve methods of solution other than paired 
comparisons. In particular, it would be anticipated that the origin would 
remain invariant, but that the unit of measurement might change, depending 
upon the sensitivity of the measurement method employed.* To obtain 
comparable scale values over a wide range of experimental situations, it is 
necessary to determine in advance the sensvtivity factor characteristic of the 
situation, i.e., the value of the multiplicative constant necessary to achieve 
complete invariance of results. It would seem reasonable to demand extension 
of the measurement model to take account of this variable sensitivity of 
measurement. Also required is further empirical investigation to ascertain 
the sensitivity factors of methods such as paired comparisons, successive 
intervals, triads, etc. 

A more direct validation of these methods would be to select a new set 
of objects, present contingent and compound paired comparisons question- 
naires based on these objects to subjects, and compare the two sets of scale 
values. If both scaling models are valid then the differences in scale values 
should be of such magnitude that they may reasonably be attributed to 
sampling error. The comparison of the contingent paired comparisons scale 
values obtained in the present study and the compound paired comparisons 
scale values obtained by Thurstone and Jones is complicated since subjects 
from different populations were tested at different times in the two studies. 

A detailed validation of the contingent paired comparisons model must 
await empirical verification of von Neumann-Morgenstern utility theory for 
individual choice behavior. Encouraging results have been obtained in this 
area by Mosteller and Nogee [6] and by Davidson, Suppes, and Siegel [1], 
but much more remeins to be done. For example, the contingent paired 
comparisons questionnaire could be used to obtain a utility function for 


*When the utilities of the “birthday gifts” are assessed by successive intervals [3], 
the solution is consistent with the origin determined by the two paired compari- 
sons methods, but a multiplicative constant must be used to achieve the identity relation. 
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each individual. A literal interpretation of the contingent paired comparisons 
scaling model implies that the average of these individual utilities should be 
equal to the average utility yielded by the estimation procedure given above. 
This possibility should be investigated, both theoretically and experimentally. 

The aims and purposes of individual utility measurement and of group 
utility measurement should be distinguished. Previous experimental de- 
terminations of utility [1, 6] have attempted to measure a utility function 
for each individual. This is in contrast to the aim of contingent paired com- 
parisons, the determination of distributions of utility over individuals. The 
appropriateness of the method depends upon the contemplated application 
of the measures. If one is interested in predicting the decision-making behavior 
of one individual then, quite obviously, the relevant utility function must be 
known for that individual. On the other hand, if one intends to use an aggre- 
gate model to make probabilistic predictions about the decision-making 
behavior of a large group or population of individuals, the parameters of the 
distributions of utilities must be determined. Such aggregate models are to 
be found in economic theory and in studies of consumer behavior and voting 
behavior [2, 9]. These distributions of utilities could be determined by measur- 
ing individual utilities. However, the determination of an individual utility 
function appears to be a time-consuming process. The method of contingent 
paired comparisons sacrifices detailed knowledge of individual utilities in 
order to reduce the time and effort required of each individual so that the 
technique is feasible for use in sample surveys. 
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A generalization of the customary model underlying the Law of Com- 
parative Judgment is established. Methods for estimating location parameters 
and testing a certain hypothesis are discussed. The problem of the influence 
of the shape of the curve used to grade responses is raised and a possible 
approach to its solution is indicated. Two examples are provided. 


In his presidential address to the Psychometric Society, Mosteller [6] 
called attention to the need of exploring the sensitivity of the method of 
paired comparisons to the shape of the curve used to grade responses. This 
paper sets up and discusses a model which seems an appropriate basis for 
such an investigation. A method for comparing results based on two different 
responses curves is suggested. Two examples are provided illustrating the 
consequences of using different response curves. 


The Model 
For the purpose of this paper, the method of paired comparisons can be 
described as follows. There are s treatments (items, stimuli) é, , --- , ¢, which 


are to be compared in pairs. When é; is compared with ¢; , there is a prob- 
ability z,; that ¢; will be preferred to ¢; , in symbols 7;; = P(t; > t;). Assume 
that 7;; = 1 — 2,; . The z;; depend on s parameters a, , one for each treat- 
ment ¢, . In order to establish the relationship between the a, and the z;; , 
we postulate the existence of chance variables 


(1) Xig = a; — a; Hi; , 


which may be interpreted as the amount of preference of ¢; over ¢; . In (1), 
the e«,; are chance variables having density f(u) symmetric about u = 0. 
Equation (1) is satisfied if it is assumed, as is often done, that ¢, produces 
sensations 


4 


X, =u +e, 


where the e, are identically distributed, and X;; = X; — X; . Because of 
the assumed symmetry of f(u), 


(2) m;; = P(t; > t.) = P(X;; > 0) = 1 — Fe; — a) = F@; — a), 


*The author is obliged to F. Mosteller for helpful discussions. 
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where 
Flu) = [ 3 f(v) dv. 


Note that the z,;; are not changed if the a, are replaced by ca, + 5b, 
where b and c > 0 are constants, and, at the same time, F(u) is replaced 
by F.(u) = F(u/c). It seems reasonable to say that two representations 
(2) which produce the same set of probabilities z;; are equivalent. In particular, 
two sets of a, which differ only by a linear transformation are equivalent. 

The main problem now may be formulated as follows. Given observed 
preference statements on the treatments ¢, , --- , ¢t, , find estimates a, of 
the a, , except for a linear transformation. Since the a, enter model (1) only 
as differences, with no loss of generality it may be assumed that 


(3) Se Qa, = 0. 

k=1 
With this assumption, the equivalence relationship implies that changes in 
scale do not affect our procedures. 

The Estimates 


Let n denote the number of times each treatment is compared with 
any other treatment, and n;; the number of times treatment ¢; is preferred 
to treatment ¢; . By (2), n;; is a binomial variable with parameters n and 


(4) ti; = F(6;;), 
where 
(5) 63; ee Es 


From (3) and (5), 


© | pes 


2D dim « 


This suggests the following method of finding estimates a, of the a, . Set 
Pi; = N;;/n, so that the p,; are the maximum likelihood estimates of the 


m,; . As in (4), define d,;; by 


(6) a, = 


(7) F(d;;) = pis ; tJ; du = 0. 
(If p;; = 0 or 1 and the range of «,; is infinite, (7) does not give a useful value 
for d;; . One possibility is to replace the observed value of p;; by 1/(2n) or 
1 — 1/(2n). If the range of «;; is finite, and p;; = 0 or 1, set d;; equal to the 
lower or upper bound for e;; .) 

In general, it is impossible to satisfy relations a; — a; = d;; , corre- 
sponding to (5), since there are more equations than unknowns. However, 
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these relations can be satisfied ‘‘on the average” in the sense of (6) by setting 


(8) a, =: a 


Following Mosteller [5], it can be shown that the estimates (8) have the 
interesting property of minimizing 


(9) ie. (a; — a; — d;;)’, 

tct™ a 
i.e., they are least square estimates. Since d;; = — d,; in view of the sym- 
metry of f(u), 


>» a, = 0, 

k=1 
in agreement with (3). If F(u) in (7) is replaced by F.(u) = F(u/c), the a, 
are replaced by the equivalent .a, = ca, . If {(u) is a normal densicy, the above 
model reduces to the one underlying the Law of Comparative Judgment or 
Thurstone-Mosteller model, as it is often called in statistical literature. 

Clearly, estimates of the a, other than those given by (8) are possible. 

However, the estimates (8) have the advantage that the method of compu- 
tation remains the same whatever the assumed response curve F(u). This 
would not in general be the case for estimates based on methods of maximum 
likelihood, minimum chi square, etc. Moreeimportant, any advantage in 
efficiency which such estimates might have, would most likely be offset by 
the uncertain character of the true function F(u) appropriate in a particular 
case. 


The Uniform Distribution 


For both theoretical and practical reasons, the uniform distribution 
deserves special discussion. If f(u) = 1, — } < u < 3, then F(u) = u + 3 
and d;; = pi; — 4. It follows that 


(10) ay = Zim — In@—D], (m= D mn). 


m, is the number of times treatment ¢, has been preferred in the n(s — 1) 
comparisons in which ¢, is involved. 

In this case the estimates a, do not depend on the individual n,,, but 
only on the n, . Further, according to (10), a, is simply a linear function of 
n, ; hence n, is equivalent to a, . From a computational point of view, no 
simpler estimates than the n, can be imagined. 

To the extent that all functions F(u) are approximately linear in the 
neighborhood of u = 0, it is to be expected that for sets of a, which do not 


differ much among themselves, estimates will be approximately equivalent 
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to those for the uniform distribution, independent of which function F(u) 
is actually used in grading the responses. This statement can be made more 


precise as follows. 
If n is not too small, approximately, 


Bag See F(d;;) — F(6:;) = dis — 5:1) f(6;;). 
It follows that if the a, do not differ much among themselves, a good approxi- 
mation is usually given by 





eli 4 
du = 10 


or 
(11) a, = =e lm — tne — DI, 


which is simply a multiple of (10). 

Use of the uniform distribution may seem objectionable from a practical 
point of view. It can be shown (private communication from John Pratt) that 
the uniform distribution cannot correspond to the difference of two inde- 
pendently and identically distributed variables. However, it should be 
emphasized that model (1) does not require that X;,; be the difference of two 
variables X; and X; . A more serious objection to the use of the uniform 
distribution is that in general it does not seem to provide a very good fit, 
as will be seen in the examples in the last section. Perhaps, one might regard 
the estimates (10) as some kind of average estimates, corresponding to all 
possible combinations of the n,,, leading to the observed n,. 


Comparison with Bradley-Terry Model 





Let 
1 
Fu) = 77S" 
We easily find 
e%! 7; 
(12) ae e*! + e%! we bem,’ 
where 
(13) ™ =e", k=1,-++,8. 


This is the Bradley-Terry model [2] which assumes that there exist param- 
eters 7, such that 7,; is given by (12). Bradley and Terry are primarily 
interested in estimating the unknown parameters 7, and in testing hypotheses 
involving the z, by maximum likelihood methods. If all 7, are multiplied by 
the same positive constant d, the ;; in (12) remain unchanged, and the two 
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sets of m,-values, which differ only by a multiplicative constant, are equiv- 
alent as far as the Bradley-Terry model is concerned. 

Equation (13) can be used to determine location parameters a, in 
terms of the 7, , 


(14) a, = In. 
If the z, are replaced by the equivalent ,7, = dz, , (14) gives 
a, = Indm, = ao + Ind, 


and the resulting ,a, are equivalent to the original a, as far as our model is 
concerned. 

The reverse, however, is not true. If the a, of our model are replaced 
by the equivalent .a, = ca, , (13) gives 


™ = ef@* = (m,)°. 


These .7, are not equivalent to the original 7, in the Bradley-Terry model. 
Thus, while the estimates p, of the Bradley-Terry model can be used to 
provide estimates of the location parameters a, , in general, our estimates 
a, cannot be used to find estimates of the 7 . 


A Test of the Hypothesis a, = +--+ = a, 


Before using a given set of estimates a, , it will be useful to establish the 
fact that the corresponding a, differ from each other. This can be done. by 
testing (and rejecting) the hypothesis a, = --- = a, by means of the statistic 


(15) A, = Ansf(0) Dai, 


where the subscript f indicates that the a, have been computed by (8) using 
the function f(u). Under the null hypothesis, A; has an asymptotic chi square 
distribution with s — 1 d.f. as n — o. This result can be proved directly 
starting from the asymptotic normality of the a, . However, the following 
proof is more instructive. 

If the null hypothesis is true, (11) holds asymptotically, and therefore, 


(16) * ~3 it, — ino — DY. 


The right side of (16) is the x?-statistic of the method of s rankings for 
balanced incomplete blocks of size 2 as given by Durbin [3]. The asymptotic 
chi square distribution of x? has been established rigorously by Benard and 
van Elteren [1]. It follows from (10) and (15) that the x?-test is not only the 
limiting form of the A,-test, but also a special case, namely, for the uniform 
distribution. It is interesting to note that the corresponding test developed 
by Bradley and Terry also reduces asymptotically to the x?-test [4]. 
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Equation (16) throws some light on the problem under consideration. 
If we are only interested in testing the null hypothesis a, = --+ = a, , 
asymptotically it does not make any difference which density f(u) is chosen 
in computing (15). More exactly, it can be shown that if A, and A, are two 
tests (15) corresponding to different densities f,(u) and f.(u), the asymptotic 
efficiency of either test relative to the other is unity. 


The Choice of F(u) 


In practice, it is generally unknown which function F(u) comes closest 
to the “true” situation. It is therefore of interest to compare the results of 
using different functions for estimating purposes. This raises the question, 
how two sets of estimates a, and a* based on two functions F(u) and F*(u) 
can be compared. It is certainly reasonable to require that the results of 
such a comparison should be unaffected if a particular set of estimates is 
replaced by an equivalent set of estimates. This requirement is satisfied if 
the comparison is based on expressions of the type F(L), where L is a homo- 
geneous linear function of the estimates a, . Indeed, if L, = cl, then 


F(L,) = F.(cL) = F(L). 


Mosteller [6] suggests a comparison based on the differences | p;; — p,; |, 
where p;; = F(a; — a;) is the recaptured proportion of preferences based on 
the estimates a; and a; . Obviously this quantity has the indicated form. 

In making a comparison of this type, it should be remembered, how- 
ever, that it is influenced not only by the functions under consideration but 
also by sampling effects. Thus in carrying out a paired comparison procedure, 
one obtains observations on s(s — 1)/2 binomial variables n;; . If s is not too 
small, it is to be expected that one or possibly several of the p,; will deviate 
considerably from their theoretical values 7;; by pure chance. As a conse- 
quence, a comparison based on | j;; — p;; | may conceivably indicate a 
worse fit for the true function F(u) than for some other function F*(u). 
Theoretically, this difficulty can be avoided by working with the population 
parameters 2,; instead of the observations p;; . Practically, this amounts 
to the assumption that n is very large. This suggests that we study the 
amount of distortion in the x;; which is due to the use of an incorrect function 
F*(u) instead of the correct function F(u). 

Given true probabilities 7;; = F(a; — a;), define 6% by F*(6#) = x; 
and set 


at x pe 5f,,/8; 


m=1 


in accordance with our earlier estimation procedure. Then 


mi, = F*? — a%), 
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and the amount of distortion may be measured in terms of the 
Ta. at; a eee 


If an over-all measure of distortion is desired, several possibilities exist. 
We might simply use 


T, = max | Tii |. 
‘i 
However, two other measures of over-all distortion come to mind, 


De re | 


r 


N 


sore 


and 





es os py 


Because of the least square character of the a% , r; seems most appropriate. 
For the uniform distribution, it is possible to give a simple formula 
expressing the recaptured 7 * in terms of the original 7;; , 


(17) ut, = 9;. —4;. + 3, (x, = > n/t): 


If r* as given by (17) is greater than 1 (smaller than 0), it has to be replaced 
by 1 (0). 
In the particular case when F*(u) = F,(u) = F(u/c), 


5%; = cla; — aj) 


and 


It follows that 
mt; = F.[c@@; — a;)] = 7; , 


and 7,; = 0. Our method of estimation is reflexive in the sense that it allows 
exact recovery of the original z,; if a distribution of the same type (differing 
only by a scale parameter) is used. (If some z;; = 0 or 1, as may happen for 
a distribution with a finite range, this statement is in general not true.) 

As a rule, the exact values of the 7;; depend on the functions F(u) and 
F*(u) as well as the a, . However, certain statements of a somewhat general 
nature are possible. If the x;; do not differ much from 3, it follows from our 
earlier results that the nature of F*(u) does not have much influence on the 
ax* . In this case, all r,;; will be quite small in absolute value. If the 7,;; cover 
the range from 0 to 1, the behavior of the 7;; is influenced a good deal by the 
tails of f(u) and f*(u). In general, if f*(u) has higher tails than f(u), the 
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x will not be as extreme as the 7;; which are close to 0 or 1, while the reverse 
is true for the 7% corresponding to z;, closer to }. If f*(u) has lower tails 
than f(u), the situation is in general reversed. 

The following analysis throws some light on the 7,; . Setting 


it, = at — of, 
one finds 
at, = F*(5%) = F*(8%) + (5% — o%)f*(64). 
Since F*(5%) = a; , 


ti; = (8%; — O4)f*(8%). 

It follows that the measures of discrepancy 72 and 7; defined above are 
weighted averages based on differences 6* — 5% with weights f*(5*,). As a 
consequence of (9), the 6* have been determined in such a way as to mini- 
mize the unweighted sum )>;,;(6% — 6%)’. 

For the usual unimodal distributions, the weights f*(6%) are large for 
values of z;; in the neighborhood of 3, small for values of z;,; in the neighbor- 
hood of 0 and 1. As a consequence, if no accurate information about f(u) is 
available, it may be preferable to use a function {*(u) which has higher tails 
than the normal distribution underlying the Law of Comparative Judgment, 
since for such a distribution, as a rule, the weights f/*(5*) do not exhibit as 
large differences as for the normal distribution. 

One possibility is the double exponential distribution given by 


fu) = 4", (—o <u< o) 
or 
F(u) = $e" for u <0, 
F(u) = 1—4e™" for u>0. 
For this distribution, 
d;; = In 2p,; for pi < 
d;; = | In2(1 — p,,;) | for pz = 


to tl 
° ~ 


Examples 


In [6], Mosteller uses an example by Guilford on vegetable preferences 
in order to illustrate the kind of differences that arise when different functions 
are used to grade the response percentages. The particular distributions 
used by Mosteller are the uniform, arcsine, normal, double exponential, and 
t,o-distributions. Table 1B of [6] gives estimates a, of the parameters a, 
corresponding to these five distributions. However, the standardization is 
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different from the one used in this paper. In his Table 3, Mosteller illustrates 
the kind of differences which can be expected between observed preference 
proportions and recaptured proportions. The measure of over-all distortion 
used by Mosteller is similar to our 7, . It is interesting to note that in this 
example the double exponential distribution provides a considerably better 
fit than the other distributions. The uniform distribution provides the worst 
fit. 

In accordance with the discussion of the previous section, we shall 
consider two examples involving true population preference probabilities 
in order to avoid sampling effects. In both cases, we use three different 
response curves, the uniform, normal, and double exponential distributions. 
For both examples, s = 4, with the given a, proportional to — 1, 0, 3, ?. In 


TABLE 1 


Influence of Distribution Type Used to Grade Response Percentages 
(True Distribution: Triangular) 




















Distribution True Uniform Normal Double 
type: exponential 
Ay -1, 36 -1, 34 -1,35 -1,37 
Az 0 -0.04 -0.01 0.03 

A, 0, 34 0. 34 0. 33 0. 
A4 1,02 1,04 1.03 
P35 595 - 580 « 593 .616 
T32 -,015 -.002 021 
P43 680 ,650 685 727 
T43 -. 030 -005 047 
P4> « 755 730 - 763 790 
T42 -.025- 008 035 
Ps) 820 - 780 325 854 
T2) -.040 -005 034 
P3) 875 - 860 879 - 888 
T3) -.015 004 013 
P4) ~955 1 951 939 
T4) 045 -.004 -.016 
Ty 2045 008 047 
T2 028 065 028 
T3 031 005 - 030 
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TABLE 2 


~ Influence of Distribution Type Used to Grade Response Percentages 
(True Distribution: t-Distribution with 1 D.F.) 

















Distribution True Uniform Normal Double 
type exponential 
A) -1, 36 -1,33 -1, 34 -1, 36 
A2 0 -0.05 -0.03 -0.00 
A3 0. 34 0. 33 0. 33 0.34 
Ag 1,02 1.05 1,04 1,02 
P32 578 - 562 - 566 .577 
T32 -.016 -.012 -.001 
P43 648 618 626 641 
T43 -.030 -.022 -.007 
P4> . 705 -680 687 «696 
T4 -.025 -.018 -.009 
P5) 2750 .710 ~724 742 
T21 -.040 -.026 -,008 
P3) 785 - 772 ode 781 
T3) -.031 -.008 -.004 
P4) 835 890 - 860 - 843 
 § 055 025 .008 
41 
T, 055 026 009 
T, - 030 018 006 
T3 033 -020 ,007 





the first example, the true distribution is assumed to be triangular, while in 
the second example the ¢-distribution with 1 d.f. (also known as the Cauchy 
distribution) is used to compute the given 7z,; . 

Tables 1 and 2 give the location parameters a, or a~ , the preference 
proportions z;; or 7% ,7 > j, as well as the measures of distortion 7, , Tz , 73 . 
(In the tables the letters A, P, T are used in place of a, 7, 7.) The parameters 
a, and a have been standardized in such a way that > af a =3(=s-—1). 
In a sense, this standardization corresponds to the requirement that the a, 
have unit mean square deviation. 

Note that all three measures of distortion convey about the same infor- 
mation. In the first example, the normal distribution provides an excellent 
fit, while in the second example, the double exponential distribution provides 
by far the best fit. This was to be expected, since the triangular distribution 
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of the first example is very similar to the normal distribution, while the 
t,-distribution of the second example with its high tails is closest to the 
double exponential distribution. In neither case does the uniform distri- 
bution provide a close fit. 

It is of some interest to compare these results with those of Mosteller’s 
example. The 7;; in Mosteller’s example run somewhat higher (in absolute 
value) than ours. More significantly, none of the five distributions used by 
Mosteller provides even nearly as good a fit as the best distribution in either 
one of our examples. Presumably, this is due to sampling effects, though it 
may also mean that the assumed model is not completely appropriate. 


Conclusions 


The model underlying the Law of Comparative Judgment can easily 
be generalized to permit distributions other than the normal. The examples of 
the previous section seem to suggest that no single distribution can be ex- 
pected to give a good fit under widely varying conditions. On the other hand, 
even the uniform distribution, though widely differing from the true dis- 
tributions, can hardly be said to give useless results. In our two examples, 
the maximum error in recapturing the true preference proportion is .055. 
Very often, sampling variability will be of a larger order of magnitude. It 
would seem, then, that the choice of a particular distribution to be used in 
grading responses is not all-important. 
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When multiple significance tests are computed, a certain number of 
“significant” findings will emerge simply because of chance fluctuations. 
In the present paper, some factors affecting the number of nominally signifi- 
cant results are elaborated and a general method is suggested which permits 
unbiased inference as to the significance of a set of findings, as a set. The 
method advocated employs a high speed computer to generate empirically 
a sampling distribution tailo: e to a particular data matrix. The method 
is illustrated in the case of dichotomous response to inventory items, where 
it is found that the statistical model still often used as a basis for estimation 
= tee. A conservative. Some problems in the application of the method are 

iscussed. 


One of the difficulties attending the interpretation of research results 
concerns the number of “‘significant’’ differences or relationships to be ex- 
pected upon a chance basis. The problem is most acute in the situation where 
a large number of statistical comparisons is made. A number of these tests 
by virtue of sampling error may appear to be significant when in actuality 
no reliable differences exist. How many chance “significant” findings may be 
anticipated, and when do the results indicate the general presence of non- 
chance relationships? 

The answer to this question is most important, for its usual consequence 
is either to terminate or to encourage the experimenter’s continued research 
interest in his particular data-set. Because an appropriate decision at this 
juncture in the research sequence has such powerful implications for the 
course of future research, it is especially desirable that the procedure employed 
for this decision provide accurate guidance. It is the purpose of this paper to 
discuss some reasons why the decision model conventionally used is incorrect 
and to suggest a more appropriate and highly general procedure for estimating 
the presence or absence of nonchance relationships. By way of illustration 
several sets of data are analyzed, with results that are counter to presently 
widespread intuitions. 

The problem of evaluating multiple significance tests is a long-standing 
and well-known one. In the psychological literature, Wilkinson [10], Brozek 
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and Tiede [2], and Sakoda, Cohen, and Beall [8] several years ago described 
a solution to this estimation problem. Their common rationale was to first 
presume independence among the outcomes of the several (or many) sig- 
nificance tests and then to calculate from the binomial expansion the chance 
probability of obtaining at least n significant results when given this first 
presumption. Following the appearance of this formulation of the problem, 
it has been noted [4, 7] that for many kinds of data the crucial assumption 
of independence of outcome of the statistical tests is in serious error because 
the variables or events providing the data for the statistical tests tend to be 
correlated. Here the matter has been allowed to rest, essentially without 
solution. 

Notwithstanding the absence of an answer to the bedeviling problem, 
there appears to have developed among psychologists a casual consensus 
that, in general, the lack of independence of psychological variables will 
greatly increase the frequency of falsely significant findings. The reasoning 
here is that a measure achieving chance “‘significance’’ will tend to bring 
toward significance also the variables with which it is correlated. For example, 
if one measure of intelligence quite fortuitously happens to relate significantly 
to the independent variable, all other intelligence-related measures will tend 
to relate to the independent variable also, providing, incidentally, a clustering 
or consistency of findings which might prove persuasive to the unwary. 

Although widely recognized as inappropriate for the kinds of data with 
which psychologists work, the independence-assuming statistical model 
continues to be employed, perhaps because some model is deemed better than 
none. When used, the interpretive convention appears to be that if multiple 
statistical tests produce significant findings in a frequency at or only modestly 
above the chance expectations issuing from the independence-assuming 
binomial model, these can be safely disregarded. Only when voluminous 
significances are achieved is, according to this viewpoint, attention and con- 
sideration of the manifest relationships warranted. The point beyond which 
the number of significant findings is considered extra-chance is set vaguely. 

The above orientation is not justified for two reasons. First, extensive 
covariation among nondiscriminating variables may mask genuinely sig- 
nificant relationships which are not redundantly measured. For example, a 
hundred measures variously reflecting, say, ‘‘adjustment’’ may fail to show 
any relationship at all to a criterion but the single measure of “‘introversion”’ 
may show itself to be a reliably significant predictor. In considering the set 
of 101 measures en masse, however, the one significant finding might be 
discounted as due to sampling variations—an unfortunate turning away 
from what might be an important research lead. While this phenomenon 
sometimes has been recognized, perhaps it has been assigned insufficient 
importance. 

The second objection is more compelling and calls attention tc a con- 
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sideration which, although obvious, has apparently not previously been 
noted as affecting the frequency of “‘significant’’ results. This is the fact 
that certain subject samples will on certain measures manifest rather little 
(reliable) variation. For example, a sample of Ph.D.’s will be characterized 
by a much smaller standard deviation than will an unselected sample on 
almost any measure of intelligence. Now, when the dispersion of a measure 
decreases because of the homogeneity of the sample to which it is applied, 
its reliability within that sample decreases also. It follows from this observa- 
tion, by reasoning akin to tha: involved in justifying the familiar correction for 
attenuation, that the likelihood is lessened of a significant association 
between an independent variable and a measure whose variance is reduced 
by sample homogeneity. If many of the measures employed show rather 
little dispersion because of sample homogeneities, then it becomes quite 
unlikely that more than a chance number of significant findings will be 
generated. Just how important this factor of reduced dispersion can be in 
some rather typical research instances, we shall very shortly illustrate. 

When simultaneously there is appreciable covariation among measures 
and the measures provide relatively little variation because of sample homo- 
geneity—a condition which arises often and justifiably in psychological 
research, especially where subtle comparisons are of interest—the con- 
ventional model is clearly inapplicable. No easy, rough and ready extra- 
polation can be made as to how the covariation characteristics of the data will 
intertwine and affect the number of “‘significant’’ results. Certain data 
characteristics depress the number of significant findings (and the re- 
searcher!); other properties of the data are inflationary. How these factors 
balance out is the crucial question. 

In the next section the problem is restated and developed as it applies 
to dichotomous responses, e.g., responses to personality inventory items, 
adjective check lists, and so on. The dichotomous case is used for illustrative 
purposes because the analytical problem arising from this form of data is a 
familiar one to many psychologists, and the logic of our argument and 
proposed solution is perhaps most readily perceived in this frame of reference. 
A solution is then proposed which requires the employment of a high speed 
computer to generate empirically a sampling distribution which permits 
proper evaluation of the significance of a set of findings. The solution is 
applied to actual data and then, in the final section, the solution is evaluated 
and certain problems in its application are discussed. 


The Problem in the Dichotomous-Response Situation 


To the item “I am aware of my conscience,” 95 percent of college 
women will say “True.” Inventories such as the MMPI or the California 
Psychologica] Inventory (CPI), which aim toward the identification of 
psychopathology or the assessment of socially valued dimensions, include a 
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preponderance of items which “‘pull’”’ predominantly one-sided responses in 
nonpathological populations. All of this is as it should be, for the screening 
purposes for which these and similar tests were designed. Typically, co- 
variation among inventory items abounds, and again, this is a desired 
characteristic of a psychological test since we seek, within limits, homo- 
geneous subsets of items. 

Now let us suppose that a researcher has scored a sample of subjects 
with regard to some variable of special interest to him. These subjects have 
also taken the MMPI. In order to assess the meaning of the cathected 
score, he identifies the “Highs” and ‘‘Lows”’ on this distribution and item- 
analyzes the set of MMPI responses to identify those items which significantly 
differentiate the two groups. 

It is simply a restatement of our earlier remarks on reduced dispersion 
as a function of sample homogeneity to note that as the variance of an item 
decreases, i.e., as one response alternative becomes more and more favored, 
that item becomes less and less discriminating and has less and less of an 
opportunity to relate to the independent basis of classification. The limiting 
case is when 100 percent of the subjects agree (or disagree) with an item. So 
universal an item can never emerge as discriminating the two contrasted 
groups and clearly it is nonsensical to “test”? such an item as a potential 
discriminator. 

Although completely one-sided response to inventory items is rare, 
approximations to universal response are very frequent indeed. For example, 
in a college population the mean frequency split for items comprising the 
CPI is 4:1 (or 1:4). Many of the CPI items are responded to in 9:1 (or 1:9) 
proportion. The same situation also characterizes MMPI responses. 
Obviously, these approximations to universal items have little power as 
discriminators and should not be judged by the same statistical criterion 
as items responded to on a 1:1 basis. 

If the researcher finds that 5 percent of the MMPI items discriminate 
between his Highs and Lows at the .05 level of significance, what is he entitled 
to say? On the one hand, item covariation may be operating to inflate his 
findings. Or, covariation, in complex conjunction with the redundancy in the 
item pool may be functioning to attenuate his results. The one-sidedness 
of response to so many of the items also is tending to reduce the likelihood 
of substantial findings. How should he weight these several factors? What 
now shall our psychologist do or say? 

The best reaction to this predicament will always be to cross-validate— 
an elementary but most satisfactory solution which tends to go unused, 
often for unworthy reasons. Only reluctantly does the psychologist abandon 
an opportunity to be original for the work and boredom of replication. In 
fairness, though, it should be noted that cross-validation is often simply not 
feasible before a fundamental research decision must be made. 

A second way of coping with the problem requires a distinction between 
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those significance tests conducted for hodge-podge, undelineated, a posteriori 
reasons and those instituted with reference to pointed, a priori questions. In 
the latter instance, where the events or items represent considered tests of 
the experimenter’s hypotheses, the occurrence of significant items which are 
predominantly consonant with the antecedent theory is strong support for 
their nonchance nature. A difficulty with this second approach, however, is 
the implicit nature of many psychological hypotheses. It should be recognized 
too that in the early stages of problem investigation, research strategy may 
call for a “shotgun approach” in order to scan empirically for predictive 
relevance in new and strange variables. The consolidation of findings can 
come later in the course of a systematic research program. It is most important 
early in the research sequence ot to overlook potential research leads. 

So, although acknowledging and emphasizing the desirability of cross- 
validation and of a hypothetico-deductive approach, the course of reality 
remains frequently to pose the psychologist with a mass of data and the big 
first question, is there substance in the set of findings which have emerged? 
Phrased alternatively, can a statistically based decision be made as to the 
chance or extra-chance nature of the set of findings, qua set? 


A Suggested Solution 


There is an obvious way of developing a solution to the problem as 
stated and many investigators must have gazed wistfully upon this alternative. 
The well-recognized solution that has always existed, if only in principle 
before, involves a return to empiricism, i.e., empirical sampling because of 
the absence of an analytic solution. If, instead of constituting criterion 
groups on a meaningful basis (e.g., Highs and Lows on a sensible variable), 
criterion groups are assembled by a randomizing device, the subsequent 
comparison of these groups of actual response protocols via item analysis 
obviously will produce items “significant’’ only by virtue of fortuitous 
fluctuations. A differentiating item, when the groups being compared have 
been aggregated by a random selector, is. not reproducibly significant. An 
estimate is thus available of the number of items likely to emerge as significant 
when groups are constituted by a random process. 

Of course, one such estimate of chance expectancy is insufficient and 
the procedure must be repeated a large number of times—a computationally 
formidable undertaking. Each time, two criterion groups must be generated 
on a random basis and the actual responses of the groups thus assembled 
compared. The number or percentage of nominally significant findings is 
noted and the sequence repeated. These repetitive item analyses should 
continue until the sampling distribution of the statistic being empirically 
established becomes sufficiently stable to permit inference from it. It should 
be noted that the statistic being charted is the percentage of statistical tests 
reaching the P level of significance when actual response protocols are 
grouped on a random basis. 
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Several aspects of this logic warrant elaboration. First, it should be noted 
that the response protocols of actual individuals are being employed but are 
being assigned to criterion groups on a random basis. Because real individuals 
are responding to the inventory, the method respects the variation and 
covariation complexities of the data provided by this sample of individuals. 
No assumptions are required as to the properties of the particular response 
matrix since the sampling distribution which issues from this empirical 
approach reflects appropriately the complex interacting effects of the 
particular data set. In essence, a sampling distribution tailormade for a 
particular response matrix is constructed where more usually we check to 
see whether our existing data can be fitted to one of the common distributions. 

As a corollary of this point, the empirical sampling distribution of the 
statistic permits generalizations only to the particular combination of indi- 
viduals and items from which it, in turn, was developed. For each change in 
the pool of persons or in the pool of items, a new empirical sampling dis- 
tribution must be established. This responsibility does not open upon a 
depressing vista as far as the labors of computation are concerned. More- 
over, it seems possible that as empirical distributions are developed for the 
variety of populations of individuals and of items, certain general results 
will emerge. If this eventuality does materialize, an investigator may be able 
simply to refer to a central repository for the specific empirical sampling 
distribution relevant to his research problem. 

The reader must also recognize that the proposed solution leaves un- 
solved the question which immediately follows upon the establishment of 
substance in a set of findings, namely, which subset of the set of significant 
relationships can be accepted as reliable and worthy of follow-up or interpre- 
tation, and which relationships are simply residues of chance? This is a 
separate dilemma, although no less important than the one we are considering 
here. 


The Solution Applied 


The solution advocated was completely unrealistic until the advent of 
high speed computers. It is only through such machines that the massive 
job of developing an empirical sampling distribution can be undertaken. To 
provide the data in the present illustration, an IBM 701 digital computer, 
a device capable of functioning at a rate of about 14,000 operations per 
second, was used. 

A computer program was prepared by the writer which accomplishes 
the following steps. (Recently, the original program has been rewritten into a 
more general and efficient form by Jack Neuhaus and the author.) 

1. Punched cards containing the response protocols of each individual 
in the total pool of individuals are read by the computer and the responses 
are transferred onto magnetic tape. 

2. A specified number of pseudo-random numbers are generated by the 
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computer and stored. Each of these random numbers serves to identify a 
particular individual’s response protocol. A first set of these random numbers 
identifies those response protocols assigned to Group A; the remaining 
random numbers select the response protocols constituting Group B. Groups 
A and B are samples selected without replacement from the total pool of 
available response protocols. (Pseudo-random numbers are numbers generated 
in the course of an endlessly cyclic computation, a computation which in the 
computer can provide usable results at very high speeds. Knowing the 
nature of the cyclic process and the starting numbers (usually primes), all 
subsequent numbers are, of course, strictly determined. However, the numbers 
produced possess random characteristics. A good discussion of pseudo-random 
procedures and tests of their properties is to be found in Arthur [1] and Meyer 
[6]. Throughout the present section, when random numbers are mentioned, 
it is to be understood that pseudo-random numbers are used in practice. The 
specific recursive methods employed here are due to Johnson [3] and to the 
Computer Section of the University of California Radiation Laboratory at 
Livermore.) 

3. The response protocols designated by the random numbers are read 
from the tape on which the data have previously been stored, and are collated 
for Group A and Group B separately. 

4. Each item is examined and the frequency difference between Group A 
and Group B evaluated by a stored version of the Latscha-Finney tables 
[5] for testing exactly the significance of 2 x 2 classifications for groups 
ranging in size from 10 through 20 individuals each. Beyond N’s of 20, 
chi square corrected for continuity is calculated for each item. 

5. A running tally is kept of the number of items significant at the .10, 
.05, and .01 levels. 

6. After print-out of the percent of items significant beyond the three 
chosen levels, the program recycles to step 2, above, to begin another esti- 
mation of chance significance. Item analyses are replicated until it is felt 
that a sufficient empirical basis for a-sampling distribution has been 
established. 

Each cycle, which typically would require a clerk a week or more, takes 
the computer about 8-20 seconds, depending primarily on the number of 
protocols in the total pool With faster computers than the presently out- 
moded 701, the time required can be reduced substantially. 

In Table 1 are presented the summary results of eight applications of 
this program to various kinds of psychological inventories requiring dichot- 
omous response. Four of these applications involve the California Psycho- 
logical Inventory (472 items) with data being provided by rather different 
samples—college men (N of 48), Air Force officers (N of 256), prison inmates 
(N of 80; for this sample only, only the first 70 CPI responses were analyzed) 
and an arbitrarily aggregated sample (N of 335) consisting of 144 college 
women, 146 college men, and 45 research scientists. A fifth application 
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employs the responses of 80 college women to the Inner Life Inventory 
(ILI), an experimental collection of 288 questionnaire items of somewhat 
unusval nature. The sixth and seventh examples are based upon the responses 
of 80 raales and 80 females to the 400 drawings in the Welsh Figure-Prefer- 
ence fest (WFPT). The eighth illustration uses the self descriptions of 34 
college men as expressed by their checks of a set of 200 adjectives. 

Although these results were developed primarily by way of illustration, 
the findings have occasioned some surprise. It is perhaps still too early for a 
strong generalization to be offered, even though a trend is in evidence. For 
the several different samples of individuals and for the several different data 
sources employed, many fewer items emerge as significant than previous 
convention has presumed. Moreover, the conservative error has not been a 
slight one—factors of 2 or 3 or even 4 can be involved. 


Discussion 


Discreteness of probabilities as a partial explanation. An earlier version of 
this paper elicited from both John Tukey and an editor of this journal the 
perception that the results of the empirical sampling procedure were being 
contrasted in a faulty albeit conventional manner against inappropriate 
levels of significance. That is, when employing the exact test of significance 
for 2 X 2 tables, discrete probabilities result. These p values tend to be 
significant beyond a chosen level rather than at that level and so for this 
reason alone, something less than .10 (or .05 or .01) of the findings properly 
should occur. 

This effect is an important one and needs to be recognized but it cannot 
account completely for the present set of findings. Reference to the exact 
probabilities for the 20 versus 20 comparison [ef. 5] indicates that the dis- 
creteness effect accounts for perhaps half or less of the discrepancy between 
our empirical results and significance levels which presume continuity. 
Reference to the fourth analysis listed in Table 1 also supports the contention 
that discreteness is an insufficient explanation. In the fourth example, samples 
of 100 individuals were compared and with samples of this size, the discrete- 
ness effect tends to vanish. Yet an appreciable divergence remains between 
the normal level of significance and the frequency with which results do 
achieve this significance level. 

A rational solution instead of an empirical estimation? Tukey and a 
Psychometrika editor have each proposed, in correspondence, a theoretical 
model that promises to account in large part for the results obtained in the 
empirical 2 X 2 analyses. Tukey’s formulation is more fully developed and 
accordingly, his explanation is summarized here. 

Given knowledge of an item’s frequency split in a large population and 
given the assumption of a binomial distribution of frequency splits for that 
item in a smaller sample, the item’s probability of significance in the smaller 
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sample may be estimated. Table 2 presents Tukey’s calculations for an item 
with a true 90-10 split when a sample of 40 individuals (20 versus 20) is 
drawn. It will be seen that splits as, or more extreme than, 36-4 will arise 
with a frequency of .63 and yet none of these splits can reject the null hypoth- 
esis of a difference at the .10 level between the two groups in response to 
the item. When allowance is made for the many conditional probabilities of 
zero, the corrected chance of ‘“‘significance at the .10 level’? proves to be 
about .017—a correction by a factor of 6. 

It seems clear that the empirical results we have reported will conform 
rather more closely to the refined expectation the Tukey explanation can 
generate. Knowing the true distribution of item splits, it is possible, by 
weighting, to calculate an over-all estimate of chance significance that will 
allow properly for the insensitivity of items responded to one-sidedly. How- 
ever, the effect of item covariation, which operates to increase the variance 
of the refined chance significance statistic, is not yet attended to by this 
model and so its degree of accuracy cannot be finally estimated. 

Despite the possibility of a rational solution and its aesthetic appeal, 


TABLE 2 


Calculation of the Chance of "Significance at a Two-sided Ten Percent Level" 
for an fiem with a 90-10 Split in a Large Population 





Conditional prob. 








Split in Probability of "significance" - Partial 
sample of of splits in a 20 x 20 = 40 probability 
40 arising* table of significance 
40,0 -015 0.00% 0.000% 
39,1 .065 0.00% 0.000% 
38,2 .141 0.00% 0.000% 
37,3 - 202 0.00% 0.000% 
36,4 - 206 0.00% 0.000% 
35,5 . 164 4.72% 0.774% 
34,6 . 107 2.02% 0.216% 
33,7 .058 9.14% 0.530% 
32,8 -026 2.34% 0.061% 
31,9 -010 3.26% 0.033% 
30,10 . 004 6.40% 0.026% 
beyond - 002 < 10.00% <0.020% 
< ,66% 





* Approximate 
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for many applications the construction of empirical sampling distributions 
may still be the method of choice because it is operational, automatic, and 
nonassumptional; it faithfully reflects all known and unknown characteristics 
of a data matrix and it is easily generalizable to other multiple-comparison 
situations. Any rational solution would have to compute or presume large 
numbers of parameter values in order to derive fairly an over-all estimate of 
chance significance. The magnitude of this computational task is probably 
of the same order as that involved in developing a sufficient empirical distri- 
bution. To the extent simplifying assumptions are introduced, the theoretically 
based estimate may go awry. Moreover, the relevant population may be 
difficult to define or if definable, may be unavailable. The effort to construct 
increasingly appropriate and more practical theoretical models is not to be 
discouraged; we simply note that in the meantime the empirical sampling 
approach by means of high speed computers can respond to the questions 
raised by researchers. 

A precise solution instead of an approximate one? Earlier in the paper 
when proposing a computer solution to the problem as stated, the logic of the 
solution was not stated precisely. Actually, the random generation of groups 
to be compared represents an expedient, not an exact solution. Strictly, the 
large but finite number of possible groupings of sizes m and n should be 
evaluated and the selected statistic—the percentage of outcomes achieving 
a designated significance level—recorded for each of the possible comparisons. 
In this way, a discrete and exact probability distribution may be developed. 

However, although finite, the number of possible groupings becomes 
impractically high very rapidly. For example, in the situation of contrasting 
two groups of 20 individuals each, selected from a total pool of 40, there are 
about 1.4 X 10” contrast possibilities. Even for a high speed computer, this 
number is astronomical. The introduction of a random basis for constituting 
groups is a concession to this reality. The sampling distribution thus provided 
is approximate but without bias. 

In this connection, it is relevant to mention a memorandom by Tukey 
[9] brought to the writer’s attention after a first draft of the present paper 
had been written. In the memorandum, Tukey considers the problem of 
comparing two small samples on many items. In his own analysis of the 
present statistical decision problem, Tukey lists complete randomization— 
the subdivision of the relevant samples into all possible subdivisions of size 
m and n—as perhaps the best, albeit impractical, alternative. It was his 
judgment that this method is ‘‘entirely valid and probably quite stringent. 
Its practical disadvantage comes ... [from] the amount of labor involved - - -”’ 
((9], p. 18). In the present proposal, the essence of the complete randomi- 
zation approach has been realized, in part by the stratagem of sampling from 
the finite but very large array of potential comparisons. It is the present 
state of technology, however, which provides the resource that permits the 
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solution—the introduction of a computer to perform the prodigious amount 
of clerical work that is still required. 

Generalization of the suggested approach. We have illustrated the problem 
only as it develops with multiple 2 X 2 contingency tests (two criterion 
groups, dichotomous responses). There is also the situation where two groups 
are contrasted in regard to a large set of continuously scaled variables. Here 
t-ratios can be repetitively computed, again with groups randomly generated. 
The empirical sampling distribution of the percentage of nominally significant 
t-ratios at a specified significance level can thus be established as a basis for 
future induction. 

To estimate the number of chance significant correlations with an inde- 
pendent variable—the case where both the criterion variable and the 
dependent variables are continuous—the same principle in a more complicated 
form applies. The score matrix of the dependent variables is left inviolate, 
thus maintaining its “natural’’ variation and covariation characteristics. 
The scores on the independent variable should then be randomly reassigned 
to the set of individuals. Because the same scores are used, the variance of 
the independent variable remains unchanged. Correlation of this randomly 
generated independent variable with the undisturbed set of dependent 
variables provides one estimate of the number of correlations likely to emerge 
through chance. Repeating this process of randomly reassigning scores on 
the independent variable will produce an empirical sampling distribution 
from which inference may be made as to the chance number of significant 
correlations. 
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NONLINEAR FACTORS IN TWO DIMENSIONS 
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Intercorrelations among tests nonlinearly related to underlying dimen- 
sions require more linear factors than content would demand. For the case 
of two independent underlying content dimensions, a fictitious example is 
constructed and made to yield a transformation useful for the nonlinear 
analysis of certain empirical data. That transformation, when applied to a 
standard factorization (centroid or principal components if certain sym- 
metries obtain) of the appropriate empirical correlations, yields parameters 
descriptive of plausible nonplanar regression surfaces for tests on the two 
underlying dimensions. An empirical example is presented and discussed. 


The dilemma of difficulty factors has beset factor analysts for many 
years [cf. 1, 2, 3, 8, and 12]. When a group of tests quite homogeneous as to 
content but varying widely in difficulty is subjected to factor analysis, the 
result is that more factors than content would demand are required to reduce 
the residuals to a random pattern. This effect is generally attributed to 
curvilinear relations among the tests, such curvilinearities being forced by 
the differential difficulty of the tests. Coefficients of linear correlation, when 
applied to such data, will naturally underestimate the degree of nonlinear 
functional relation that exists between such tests. Implicitly, then, it is non- 
linear relations among tests that lead to difficulty factors. Explicitly, however, 
the factor model rules out, in its fundamental linear postulate ({11], p. 68), 
only such curvilinear relations as may exist between tests and factors. 

Some recent developments [4, 5] in the theory of latent profile analysis 
(the extension of Lazarsfeld’s latent structure analysis [10] to the study of 
interrelations among quantitative variables) have shed new light on the 
problem of difficulty factors. For the case of a single underlying dimension, 
there now exist nonlinear solutions which account for the intercorrelations 
as they stand. No attempt is made to obliterate the tell-tale difficulty pattern 
in the correlations by the choice of coefficient or by “corrections” or ‘‘adjust- 
ments” of any kind. The solutions account for the intercorrelations with 
precisely the same discrepancies as does the centroid factorization, for they 
are linear transformations of that factor matrix. They describe nonlinear 
regressions of tests on the single underlying dimension in a way that conforms 


*The opinions expressed are those of the author and are not to be construed as 
reflecting official Department of the Army policy. 
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with commonsense expectations for easy and difficult tests. Regressions of 
easy tests on the underlying dimension are concave downward. The reverse 
obtains for difficult tests. 

The purpose of the present paper is to provide a similar kind of analysis 
for a set of tests that are nonlinearly related to two statistically independent 
underlying dimensions. This will be done by developing an idealized fictitious 
example to a point where it provides a transformation that can be used 
empirically. The application of that transformation to a standard factori- 
zation (the centroid or principal axes if certain symmetries obtain) of an 
appropriate set of empirical correlations results in the desired nonlinear 
solution. Such an application will be illustrated on empirical data provided 
by Dingman [2]. 
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Ficure 1 
Theoretical Partitioning of a Normal Bivariate Surface for Uncorrelated X and Y 


Consider, then, the correlation surface represented by Figure 1. X and 
Y are two hypothetical statistically independent underlying dimensions. 
The correlation surface is normal bivariate, and the dotted lines divide it 
into nine sectors in such a way that every vertical and horizontal array has 
frequencies in the ratio 1:2:1. In other words, the standard score equations 
for the vertical and horizontal dotted lines are, respectively, X = + .6745 
and Y = + .6745. The dashed lines divide the space into five regions, one 
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of them coinciding with the middle sector, and each of the other four consisting 
of a side sector and half of each adjacent corner sector. These five regions are 
lettered A through £, and their descriptions and relative frequencies are as 
follows. 


Low X, average Y 3/16 
High X, average Y 3/16 
Average X, average Y 4/16 
Average X, high Y 3/16 
E Average X, low Y 3/16 


The people in each of these regions will henceforth be referred to collectively 
as a latent class. 

Imagine, next, a set of six tests so constructed that all members of a 
latent class earn the same score on each test, though the various classes may 
earn different scores on the same test, and the same class may earn different 
scores on different tests. Tests 1-3 constitute a set of tests of dimension X, 
test 1 being easy, test 2 being of medium difficulty, and test 3 being hard. 
Tests 4-6 are easy, medium, and hard tests, respectively, of dimension Y. 
Within each of these two sets of tests we will expect to find the usual difficulty 
pattern of intercorrelations—the greater the disparity in difficulty, the lower 
the correlation. The between-set correlations should all be zero, since the two 
underlying dimensions are independent. 

These six fictitious tests may be described in terms of their surfaces of 
regression on the two underlying dimensions. This is perhaps best visualized 
by imagining the surfaces erected over Figure 1. Tests 2 and 5, of medium 
difficulty, should have planar regression surfaces, the plane for test 2 tilting 
upward from left to right along axis X and the plane for test 5 tilting upward 
from the negative to the positive end of axis Y. The regression surfaces for 
the other four tests should be curved as well as tilted, the curvature and tilt 
being along the appropriate axis, and the concavity being downward or up- 
ward according as the test is easy or hard. For example, the regression surface 
for test 1 should, in progressing from left to right along axis X, first climb 
steeply and then level off, since test 1 is an easy test of dimension X. 
Analogously test 6, a hard test of dimension Y, should be concave upward 
with its curvature and tilt being along axis Y. 

A more precise expression of the foregoing verbal descriptions of the 
various regression surfaces is to be found in Table 1. There rows 1-6 of the 
matrix L’ give, in standard score form, the thirty class averages of the six 
fictitious tests in the five latent classes. Each of these rows defines the surface 
of regression of the corresponding test on dimensions X and Y. The numerical 
entries in the row describe the regression surface in terms of five vertical 
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coordinates, one for each latent class, that may be imagined as erected upon 
Figure 1. For concreteness, let each vertical coordinate be imagined as erected 
upon the X-Y centroid of its class—the intersection of the two mean lines 
(X and Y) for the class. Then these five coordinate points lie in a tilted plane 
for tests 2 and 5; for the other four tests they describe curved surfaces. For 
tests 1 and 3, the main curvature is along axis X, while for tests 4 and 6, it 
is along axis Y. 

A secondary type of curvature has appeared in the regression surfaces 
for tests 1, 3, 4, and 6. Up to now these surfaces have been considered as if 
they were cylinders, a cylinder being defined mathematically as the surface 
swept out by a line moving through space in any path, but always parallel 
to its original position. Now consider, for example, the five-point regression 
surface for test 1, and in particular, the test-1 means for classes D and E— 
the two classes that are lateral to axis X, the underlying dimension for test 1. 
These means are lower than the class-C mean for Test 1. It is as though an 
initially cylindrical regression surface had shrunk and curled inward (i.e., 
in the same direction as its major concavity) along its edges. The very same 
secondary curvature appears in the other three nonplanar regression surfaces. 
This secondary curvature is attributable to the requirement, mentioned 
earlier, that all between-set correlations be zero. An alternative fictitious 
model, having purely cylindrical regression surfaces, is quite simple to con- 
struct. In it test 1, for example, would have a low mean on class A but high 
and equal means in the other four classes. However this could only be at the 
expense of some fairly substantial nonvanishing between-set correlations. 
The model that preserves complete between-set independence seems pre- 
ferable, for the secondary curvature is minor compared to the principal 
curvature. 

The matrix L’ in Table 1 is called a latent profile matrix because the 
columns give, for each latent class, its profile of standard scores on the tests. 
These latent profiles correspond to the verbal descriptions of the latent 
classes that were previously given. For example, latent class A is, on the 
average, low on tests of dimension X and at the mean on tests of dimension 
Y. Row 0 of L’ has a purpose that will soon become apparent. Row V at the 
bottom of Table 1 shows again, in decimal form, the relative sizes of the five 
latent classes. 

To obtain the matrix of intercorrelations, R, among the six tests when 
the underlying structure is as given in Table 1, let a 5 X 5 diagonal matrix 
V be formed so as to contain the five relative class sizes in its diagonal cells. 
Then the matrix equation for R is simply 


(1) Rk = L’'VL, 


since all members of a latent class earn the same standard sccre on each test, 
and since a correlation coefficient is simply the average product of standard 
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scores for the two tests involved. The condition that all members of a class 
have identical score profiles amounts to restricting all within-class test 
variances and covariances to zero. In empirical contexts where (1) is con- 
sidered to hold, it turns out, fortunately, that there can be considerable 
variation among test scores within latent classes, though zero within-class 
covariances must obtain as a matter of definition (cf. [5] and later discussion 
in the present paper). 

The matrix R for the present fictitious example is shown in Table 2. 
It is bordered at top and left by row and column 0, containing the six standard 
score grand means of zero and, in the diagonal cell, the sum of the five rela- 
tive class sizes. The last six diagonal cells in R (they would have been unity 
but for rounding) indicate that proportion of the test variances that is 
attributable to between-classes variation. The pattern of magnitudes of 
the correlations in Table 2 should be noted. Both groups of within-set correl- 
ations show the expected difficulty pattern of decreasing correlations as the 
difference in difficulty increases. Between the two sets of variables the correl- 
ations are all zero. 

Table 3 shows a standard factorization, F, , of the R of Table 2. The 
first column of F, is obtained by diagonal factoring pivoted upon the vector 0. 
The last four entries in Row 0 of F, thereby become zeros. The remaining 
part of F, is the centroid and principal components factorization of the 
unbordered correlation matrix. The pattern of magnitudes and of signs in 
F, should be carefully noted. It is only after four centroid factors are ex- 
tracted that the residual correlations vanish except for rounding discrepancies. 

Now define 


(2) F, = L'v™, 


where V’”’ is diagonal and contains the square roots of the relative class 
sizes in its diagonal cells. F; may be called, for present purposes, the ‘‘true”’ 
factorization of R. Since F, and F, are orthogonal factorizations of R, there 
exists an orthogonal transformation, A,; , such that 


(3) F,Au a F, . 
The least squares solution for A,, in (3) is 
(4) Au ‘ah (F{F,)'FiF, ’ 


which is readily obtained in case F, is a principal components factorization, 
for then FF, is diagonal and readily inverted. The A,, for this example is 
displayed in Table 4. 

A more useful transformation than A,, is A,,V~’”’, for it yields L’ 
directly when applied to F, . Thus, by virtue of (3) and (2), 


(5) Paar ne F,V-” = L'Vy-” = L’. 
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It further will turn out that the important rows in L’ (all but row 0) can 
be obtained by using only the last four columns in F, and the last four rows in 
A,,V~’”. Thus the bordering of the correlations by row and column 0 is a 
computationally trivial operation and need not be done in practical appli- 
cations that make use of this particular transformation. It is sufficient to 
factor the correlations themselves and apply the 4 X 5 linear transformation 
to the resulting 4-column factor matrix. Let this 4 X 5 transformation, con- 
sisting of the last four rows of A,,V~’”’, be designated 7’. It happens that all 
entries in 7’ contain the factor 1/+/3, so that a compact way of displaying 
the desired transformation is to show V8T. It is shown in Table 5. The row 
headings in Table 5 now designate the four centroid factors. 

Now consider the empirical correlations shown in Table 6. They are 
taken from Dingman ([2], p. 25) and are based upon a sample of 479 college 
students. The tests involved in Table 6 were constructed by Dingman from 
items in the Guilford-Zimmerman Tests of Spatial Visualization and Verbal 
Comprehension [9]. Tests 1, 2, and 3 were made up of Spatial Visualization 
items of low, medium, and high difficulty, respectively. Tests 4, 5, and 6 
were made up of Verbal Comprehension items with the same gradation of 
difficulty. Tests 1-5 each consist of ten three-alternative multiple choice 
items. Test 6 is made up of nine items of the same type. Kuder-Richardson 
Formula 21 reliabilities appear in the diagonals of R, and average item 
difficulties 7, in terms of proportion passing according to Dingman ((2], p. 
22), are shown across the bottom in Table 6. 

The pattern of magnitudes of the correlations in Table 6 is to be com- 
sare? with that of the fictitious correlations in Table 2. The within-set 


TABLE 6 


kup /cical Cor)+.atiens, R (with KR-21 Reliabilities in Diagonals), 
end Averace Item Difficulties, p, for Dingman Example 
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R 
No. 1 2 3 4 5 6 
1 Rasy Spa st Fruaisretion 083 62 039 ell 00k = = 08 
2 Medium {ye%...) #isualization 262 Py pe 54 007 200 -.09 
a Hard Spatiei Visualization 039 254 67 200 -.05 -210 
4 Easy Verbal Comprehension ell 207 200 63 oS 033 
5 Medium Verbal Comprehension 04 200 =.05 45 TL 261 
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Hard Verbal Comprehension -.08 -.09 +10 033 61 73 
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TABLE 7 TABLE 8 
Centroid Factorization, Fourth-Factor Residuals 
a, » for Dingman Example For Dingman Example 
I II IIt Iv 1 2 3 4 } 6 








003 -.03 .Ol .02 -=.01 -.01 
-.03 .07 -.04 -.04 .O1 202 
Ol -.04 .01 .O1 -.0L -.01 
02 -.04 .0l .02 -.01 -.O1 
-.0l .01 -.01 -.01 .05 -.04 
-.0l .02 -.01 -.01 -.04 -06 


-60 52 a7 -.32 
-60 57 205 06 
46 -52 -.31 28 
-.41 234 27 
6 6 =5B. sD. «=D 
4h = +. 2 -.25 -.19 
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correlations show similar difficulty patterns, though at a lower level, and the 
between-set correlations are low and fairly randomly distributed around 
zero. 

Table 7 shows a centroid factorization of the correlations in Table 6. 
The Kuder-Richardson reliabilities were used as the diagonals of the corre- 
lation matrix for this factoring. The fourth-factor residuals, including com- 
puted residual diagonals, are shown in Table 8. Were it not for the difficulty 
pattern in the correlations of Table 6, the second-factor residuals might have 
reached this level even with reliabilities in the diagonals. Except for the 
differences in difficulty, all within-set correlations would have been, essen- 
tially, alternate-forms reliabilities, so that the extraction of one factor for 
each set would probably have exhausted the ‘‘real’’ part of all intercorrelations. 

The pattern of loadings in Table 7 is to be compared with that of the 
fictitious loadings in Table 3. All four centroids are in close agreement with 
regard to pattern of signs and of magnitudes. The argument that the last 
one or two Dingman centroids are residual is defeated not only by the size 
of the loadings therein, but also by their predictable pattern. A smaller sample 
or shorter tests might possibly have obscured that pattern. 

It is at this point that an extra step in the computations will be ex- 
hibited, in the present case more for the purpose of illustrating the possible 
need for such a step than out of necessity. We have already seen good corre- 
spondence between the centroids of the fictitious and of the Dingman data. 
That correspondence is due to the symmetric representation of easy, medium, 
and hard tests of each of the two abilities in the Dingman data. Had there 
been, let us say, a disproportionate representation of easy tests, then the 
correspondence would not have been nearly as great, and some amount of 
adjustment, in the form of an orthogonal rotation, would probably have been 
mandatory before complete one-to-one correspondence between fictitious 
and empirical centroid factors became evident. Such adjustments could be 
made by analytical means [cf. 7], in which an orthogonal transformation is 
sought such that the empirical centroid factorization is carried into maximum 
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conformity with the fictitious one. With the Dingman data only slight 
adjustments appeared possible. These were made by graphical rotations, in 
which the aim was to have the plots of the rotated empirical factors show the 
same symmetric arrangement of points as do the plots of the fictitious 
centroid factors. In this process, somewhat more attention was given to the 
points representing the easy and hard tests, with the tests of medium difficulty 
receiving less consideration. 

The resulting orthogonal transformation X is given in Table 9. Because 
of the essentially diagonal form of that transformation, little improvement 
was possible with the Dingman data. In other empirical applications this 
transformation might be less nearly diagonal, and might adjust, for example, 
to a reversal in the order of extraction of the last two centroids, or to the need 
for reflecting one or more of the centroid axes. 

The adjusted factorization F, for the Dingman data is given by the 
equation 
(6) F, = F,X. 


It appears in Table 10. Now we post-multiply F, by T to obtain a latent 
profile matrix L’. Thus, 























(7) L’ = FT. 
TABLE 9 TABLE 10 
Adjustive Orthogonal Transformation, Adjusted Factorization, F,=F)X, 
X, for Dingman Example for Dingman Example 
I II III Iv I II III IV 
I 1.00 200 206 =.05 x oT 252 038 027 
II 200 299 ell o13 2 260 055 015 ell 
III -.06 -.12 99 204 3 oh9 52 = 2h e331 
Iv 0H --12 205 99 4 eG == e351 220 
> 056055 -.12 023 
6 045 9056 2B 0 





TABLE 11 


Latent Profile Matrix, L'=F,T, for Dingman Example 











No. Test A B c D E 
2 Easy Spatial Visualization -1.78 72 66 015 03 
2 Medium Spatial Visualization -1.28 1.36 226. <= 016 = 2T 
3 Hard Spatial Visualization = 67 1.66 -.42 = .25 = 218 
4 Easy Verbal Comprehension 04 206 254 e7l = $1.53 
5 Medium Verbal Comprehension = eel - 8 = 221 1.61 - 9h 
6 Hard Verbal Comprehension - 06 - 31 - 8 1.67 — 6 
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The resulting L’ for the Dingman data is shown in Table 11. The relative 
class sizes V here are the same as they were in the fictitious example. Though 
having no row 0, the L’ of Table 11, along with the diagonal V, reproduces 
Dingman’s correlations according to (1), with discrepancies equal to the 
fourth centroid residuals of Table 8. In this reproduced R the diagonal terms 
7;; indicate the proportion of each test’s variance that is attributable to 
between-classes variation. Their complements (1 — 7;;) therefore indicate 
the proportion of within-class variance. These complements average slightly 
over .30, so that there is considerable variation of test scores within the 
latent classes. There can be no covariation of test scores within classes, how- 
ever, because (1) holds for empirical data only when the latent classes are 
defined as having zero within-class covariances [5]. It is only when all of its 
within-class covariances vanish that a latent class contributes to across- 
class covariation as 7f all of its members had the same score on each variable. 
The theoretical justification for such a definition is that any group that is 
homogeneous with respect to all underlying dimensions will have had, ipso 
facto, all of its internal covariation partialled out. 

The L’ of Table 11 is to be compared with the fictitious L’ in Table 1, 
and in the context of the arrangement of classes in Figure 1; it is assumed 
here that the Dingman classes are so arranged. Dimension X may here be 
named ‘Spatial Visualization,’ while dimension Y may be called ‘Verbal 
Comprehension.’”’ The easy and hard tests of both dimensions are in good 
agreement with theoretical expectations in terms of the type of tilt and 
curvature their regression surfaces show. For example, test 1, an easy test 
of Spatial Visualization, has a regression surface which is concave downward, 
with its tilt and its essential curvature being along the Spatial Visualization 
dimension. Test 6, a hard test of Verbal Comprehension, has an upwardly 
concave regression surface, tilted and curved mainly along the Verbal Compre- 
hension dimension, etc. Even the secondary curvature of these four regression 
surfaces—the inwardly curled edges seen earlier in the fictitious example—is 
faithfully preserved in this particular empirical solution. Dingman’s test 2, 
of medium difficulty, is also in good agreement with theory, for its regression 
surface is fairly planar and is tilted along axis X. Test 5, while having a 
regression surface that is more planar than that for the other two Verbal 
Comprehension tests, is nevertheless not too different from test 6 in its 
regression surface. This curved regression for test 5, as though it were more 
of a hard test of dimension Y than one of medium difficulty, might have been 
less pronounced if more attention had been given to the location of point 5 
in the plots that were used to develop the adjustive transformation X. 

In envisioning the application of the present approach to other sets of 
empirical data, some of its indeterminacies should perhaps be kept in mind. 
One of these is its rotational nonuniqueness. Linear transformations other 
than the 7’ presented here, when applied to the (possibly adjusted) centroid 
or principal-axes factorization of the test intercorrelations, might lead to 
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equally or more plausible latent-profile matrices. In particular, different 
relative frequencies for the latent classes, in which class C becomes relatively 
larger or smaller, would imply different transformations and hence different 
latent-profile matrices and regression surfaces. Furthermore, the exercising 
of different rotational tastes in developing the adjustive transformation X 
(where, for example, more or less attention could be given the tests of medium 
difficulty) would certainly have an effect on the form of the resulting regression 
surfaces. Experience with an analogous unidimensional nonlinear solution 
[5] indicates that, within the restriction that all regression lines be monotonic 
increasing, all alternative rotational solutions leave invariant the relative 
curvatures of these regressions. Similar restraints are undoubtedly operative 
in two dimensions, though their exact nature and extent have not been 
investigated. 

A second indeterminacy of the present approach is the matter of the 
spatial arrangement of the resulting latent classes. The only test of the 
assumption that the Dingman classes are arranged as in Figure 1 is the 
plausibility of the resulting interpretations. This is far removed from psycho- 
metric rigor. There now exists [4] a psychometric solution to this problem 
for the case of one dimension, but the multidimensional case is essentially 
unresolved. 

Some possible extensions of the present approach are perhaps obvious 
to the reader. One of these is to the case of three or more underlying dimen- 
sions. The fictitious example involving three dimensions would require two 
more classes on either side of class C along a third dimension Z. Such an 
example is easily set up and made to yield a 6 X 7 transformation T appli- 
cable to the appropriate empirical data. As the number of dimensions goes 
up, however, the sample size and test lengths may also need to increase in 
order to preserve the expected patterns in the last centroids of any empirical 
application. 

Another possible extension would be an attempt to achieve greater 
continuity in the regression surfaces for the present case of two underlying 
dimensions. The five-point regression surfaces we have dealt with here con- 
stitute rather meager representations of surfaces that are perhaps more 
continuous. An improvement, in these terms, would be a solution involving, 
instead of one latent class for each of the five regions in Figure 1, a latent 
class for each of the nine sectors therein. Such a fictitious model is readily 
formed, and it allows for complete between-set independence without the 
sacrifice of cylindrical regression surfaces. However, it requires the extraction 
of six centroids before the residual correlations vanish, and it is doubtful 
whether empirical correlations would show enough fidelity to the model in 
their last centroids so that a standard transformation would be useful. Other 
improvements in continuity could be tried, though these might all tend to 
ask more of empirical data than they are generally able to give. 

Two further and perhaps related extensions concern (i) tests of more 
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than one dimension and (ii) correlated dimensions. Recall that only pure 
tests of either dimension have been involved so far. An easy method for 
generating fictitious composite tests is to form various linear combinations 
of rows in the fictitious L’. For example, a fictitious test that discriminates 
at the low end of dimension X and at the upper end of dimension Y would be 
the sum of tests 1 and 6. The correlations for such tests are readily obtained 
by augmenting L’ by such additional (standardized) rows and applying (1). 
A kind of reversal of this augmenting process would be to remove some of the 
pure tests of dimension X and/or dimension Y. It may be, as some proponents 
of orthogonal rotation in linear factor analysis might argue, that the only 
difference between uncorrelated and correlated dimensions is the presence 
or absence of such pure tests. If so, then a solution for correlated dimensions 
might consist of finding linear combinations of the available tests such that 
the resulting correlations are close to the fictitious pattern of Table 2. This 
and other approaches to the oblique case are perhaps worthy of further 
investigation. The problem of nonlinearly related underlying dimensions 
has not yet been studied seriously. 
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The problem of measuring the association between two characters, 
one quantitative and the other qualitative, is discussed. The formula for 
the large sample standard error of the point biserial correlation coefficient 
under general conditions is derived. The point multiserial correlation coef- 
ficient is introduced and some of its properties are examined. Tests of different 
hypotheses appropriate to these types of problems are formulated. 


We shall deal with the measurement of association between two 
characters, under certain assumptions regarding the interdependence between 
them (e.g., linearity), when one of the characters is given in a quantitative 
form and the other is either directly qualitative or in the form of gradings 
which have an underlying quantitative basis with a unique order. The 
qualitative character divides the universe under study into a number of 
distinct classes. If for these classes a set of scores can be found, then the 
problem reduces to a simple one, namely to find the product moment correla- 
tion coefficient. The problem of finding these scores has been dealt with by 
Thurstone, Likert, and Guttman (see [5]) as well as by Wherry and Taylor 
[13] and by Das Gupta [3]. When the scores cannot be found, the only solu- 
tion is to treat the quantitative character as a qualitative one, i.e., to create a 
number of classes by grouping the values of this character. Usually, the 
contingency chi square is calculated to provide a measure of association; 
the latter depends on the nature of the grouping and is difficult to interpret 
except for the case of complete dependence or complete independence. 

Sometimes it is assumed that the qualitative character has a point- 
distribution, i.e., the whole universe is segregated at a number of distinct 
points with respect to that character. When the number of categories (or 
points) provided by the qualitative character is two, then any system of 
scores can be reduced to the scores 0 and 1. In this case the product moment 
correlation coefficient known as point biserial correlation coefficient measures 
the linear association between the two characters. When the qualitative 
character cannot be quantified (e.g., sex, religion, etc.) then any system of 
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scores would be meaningless. In this case we really need some measure of 
divergence between the categories or sub-universes with respect to the 
quantitative character. But some sort of scoring mechanism may help to 
clarify the relation between the variables. 

To determine the effect of sampling on the point biserial correlation 
coefficient, (abbreviated as pbs) a number of postulates regarding the joint 
distribution of the quantitative and the qualitative characters must be 
specified. It will be shown that the assumptions made by Tate [12] and Lev 
[8] are not suitable in the field of psychometrics where the pbs is used to 
determine the item discrimination values in a test battery. A general model 
is suggested in this paper, and the large sample standard error of the pbs is 
derived following that model. 

When the number of categories of the qualitative character is greater 
than two, then the linear relation between the characters may be measured 
by the product moment correlation coefficient, named point multiserial 
correlation coefficient (pms), provided the set of scores assigned to these 
classes is known. If the postulates made by Tate are generalized to fit this 
case, we can find the distribution of the sample pms, but it will be shown 
that this distribution is of little use. For the problem cited above, the method 
of testing of certain hypotheses is discussed in this paper. 

It is appropriate to mention that Roy has contributed much to the 
solution of these types of problems, a topic which he calls the analysis of 
categorical data ({9], pp. 113-134). 


The General Model 


Consider a universe Z consisting of two sub-universes Zp and Z, , and 
a continuous random variable X defined over Z. We shall make the following 
assumptions regarding the distribution of X. 

Assumption 1: X has the probability density function f(x) with finite 
mean m and finite nonzero variance o’. 

Assumption 2: X has finite moments of at least up to the fourth order. 
_ Assumption 3: The probability density function of X in the sub-universe 
Z; (¢ = 0, 1) is given by f,;(x). The mean and variance of X in Z,; are given 
by m, and o? respectively. 

Assumption 4: The probability that X comes from Z, is P; , where 
P, + P, = land0 < P, < 1. 
From the above assumptions, it follows that 


(1) f(a) = Pofo(x) -+ Pif,(@). 
A coefficient defining the divergence between the sub-universes Z) and 
Z, , as measured by the variable X, is 


“ai 2 
(2) D = pp(™—™) : 
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Instead of considering the coefficient of divergence, we may introduce a 
discrete random variable Y defined over Z such that 


Y=y, when YeZ and Y=y, when YeZ,, 
where 
(3) Prob (Y = y:) = P; (¢ = 0, 1). 


When the partitioning of Z into Z, and Z, is on the basis of an attribute 
(e.g., sex, religion, etc.) which cannot be quantified, the variable Y is to be 
treated as a dummy variate. Here, some function of the product moment 
correlation coefficient between Y and X will measure the degree of divergence 
between Z, and Z, . 

When the attribute is measurable (e.g., general intelligence, mechanical 
aptitude, etc.), it is assumed that there exists an underlying continuum from 
which the discrete, two-valued variable Y is obtained by the dichotomi- 
zation of the continuum. The product moment correlation coefficient between 
Y and X will measure the linear relationship between them. If a unique order 
is assumed for yo and y, (Yo < y,) then they can be transformed (without 
any loss of generality) to 0 and 1 respectively. 

Thus a bivariate distribution of X and Y is defined over Z. Let, 


(4) us; = E[(X — E(X))‘(Y — E(Y))'). 
From (4), 
(5) E(X) = Pim, + Pom .« 


Thus, from (4) and (5) 


() may = PPS [ @ — m, + Po A)‘fi) a 


+ P(—P)' [ ( — mo — Py Ay fle) de, 


where 

(7) A =m, — ™,. 

Using (6), 

(8) pu = P,Pi A; 

(9) Bao = Pyoi + Pood + PsP. A®* = 0°; 


(10) Hor = P,P, . 
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The product moment correlation coefficient between X and Y, called 
point biserial correlation coefficient (pbs), is given by 
ee P,P, A 
VP,Po VPi0i + Poos + PiPo A’ 








(11) p= 


= VpPp,A4. 
Cc 


It is seen from (2) and (11), that 
(12) p = D. 
Sample Value of pbs 
Consider the following sampling schemes. 


Scheme 1. Draw a random sample of size N from the above bivariate universe, 
the sample values being given by (X; , Y;), 7 = 1, 2, --- , N. The suffixes 
of X and Y are written in such a way (without loss of generality) that 


Y,=0 for l<tcm, 
Y,=1 for mn <i<N, 
Mtn =N. 


Scheme 2. Draw two random samples of sizes v» and n, from the universes 
Z, and Z, respectively. 
Scheme 2 differs from scheme 1 in that n) and n, are nonrandom in scheme 2. 
However, we shall follow scheme 1 unless specified otherwise. 

The following notations are used. 


iH 


tes tet et. PAs Se. 
No 1 11 not1 N % ' 
(14) P= h> Po = 3 


no N N 
So aa >? (X; wer ie Si vans > (X; if a; Ss’ = 3 (X; _ x 
1 1 


norl 
Then the sample pbs is given by 


a NPiPo (Xi — Xo). 
S 





(15) r 


Large Sample Standard Error of r 


It is known ({1], p. 359) that if r is the product moment correlation 
coefficient between two random variables X and Y derived from a random 
sample of size NV, then 
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(16) He) = » + (4), 


2 
2u Auce 
iy ee x |e Hos, 2itao , 4u 
7) ogi © 4N B20 sf Hos sf H2oKo2 + Hin 


pes 4us: es dua 140 +) : 
Kiibe20 Miibo2 





where E(r) and Var(r) denote the mean and variance of r respectively, 


p = population correlation coefficient, 
my = E[(X — E(X))‘(Y — E(Y))’, 


and O(1/N”) indicates the remainder of the series, which, if multiplied by 
N’, tends to a constant as NV’ approaches infinity. Applying formula (6), 


(18) pao = (Prue + Povs) + 4PiPo A(us — 03) + 6 A*P1Po(Pott2 + Pie) 
+ P,Py A*‘(1 — 8P;Po) 
= Bro", 
(19) wos = PiPo(1 — 3P:Po); 
(20) poo = P,Po(Potts + Pivs) + A’PoPi(l — 3PoP,); 
(21) ms = AP,Po(1 — 3P;Po), 
(22) ws: = PyPo(us — v3) + 8PiPo A(Pots + Pitz) + PiPo A*(1 — 3P;Po), 


where 


(23) u;,= [@ — ma) ‘f,(2) dz, 


(24) 4 = f (e — me)'fol) de | . 
B. = kurtosis of the distribution of X. 


We have wu. = of and vz = of . Using (17), (8) through (10), (18) through 
(22), the variance of r is obtained, after some algebraic manipulations, as 


ee ke 2. 1— 3P Po = 2 
(25) Var”) = ay E {es + a 2) ee yf 


+ 21 — 6)(2 — 5pt) FOE EE 


— 8 (wu, — 0s) VPP. | + 0( xr) 
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where 
(26) K = o6/o; 


and r = sample pbs derived from a random sample of size N. An estimate 
of Var(r) can be obtained by replacing the parameters involved in (25) by 


their corresponding sample estimates. 
It is to be noted from (18), that the set of parameters involved in the 


joint distribution must satisfy the relation 
[B.0* > 4P .P, A(us = V3) and 6 APP (Pots 4. Pv.) 


— P,P, A‘(1 — 3P,P,)] > 0 
or, 


(27) E —4VP;P> 5 (us — V3) — o( Fe =i Pi) — p) 


Pa 4 1 sie aPsPe) 
p (i58ne > ©. 


Asymptotic Variance of r for Other Models 
The following cases, which are subcases of the general model described 
earlier, are now considered. 
Case 1: Normal model 


Assume X normally distributed with mean m and variance o’. Here 
6B, = 3 and 


Hso = (Pius + Pos) + 38P:Po(oi — oo) A + PiPo A*(Po — P) = 0, 


and (25) reduces to 








me: Ss 21 — 2p(1 — 3P,Py) 


+ 21 — 22 — 5p)(fe+ Pik) in 4 — VPP. | 


P, + PoK. 
+ ols): 


Subcase (la): K = 1, us = 3. 
From (28), 


(29) Var@) =z k z = p+ 22 — 7p + 86% | rs o( x=); 





The relation (27) becomes 


(30) ‘(9 ie pis) «tf 43> 0; 





























For a given p, the range of P, , for which the above inequality will be 
satisfied, is given in Table 1. In general, (30) will be always satisfied for 
.22 < P, <.78 (approximately). Var(r), as given in (29), is calculated for 


S. DAS GUPTA 


admissible sets of parameters and presented in Table 2. 


Admissible Range of P, for 0. 15.ef. 1)*. 1.0 


TABLE 1 
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p Lower limit of P Upper limit of P, 
0.2 0.00003 0.99997 
0.2 0.0006 0.9994 
0.3 0.0032 0.9968 
0.4 0.0114 0.9886 
0.5 0.0313 0. 9687 
0.6 0.0694 0.9306 
0.7 0. 1283 0.8717 
0.8 0. 1742 0.8258 
0.9 0. 2037 0.7963 
1.0 0.2115 0.7885 
Subcase (1b): us = v3, P,; = Po. 
From (25), 
(31) Var () = sb @ — 56 + 49%) + 0(x4a)- 
2N N 
TABLE 2 
N[Var (r)] Obtained from (26) for Different Admissible Values of p and P; 
(Division by N yields an estimate of Var (r).) 
P, 
p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
or P 
° 
0.05 1.0169 1.0601 1.1058 1.1150 1.0329 
0.10 0.9926 0.9686 0.9224 0.8446 0.7222 0.5384 
0.15 0.9846 0.9386 0.8621 0.7557 0.6201 0.4560 0.2646 
0.20 0.9807 0.9239 0.8270 0.7124 0.5703 0.4159 0.2607 0.1184 
0.25 0.9785 0.9155 0.8158 0.6875 0.5417 0.3928 0.2585 0.1595 0.1198 
0: 30 0.9771 0.9102 0.8053 0.6719 0.5238 0.3784 0.2571 0.1851 0.1915 
0.35 0.9762 0.9068 0.7985 0.6620 0.5124 0.3692 0.2562 0.2015 0.2375 
0. 40 0.9756 0.9047 0.7943 0.6557 0.5052 0.3634 0.2556 0.2117 0.2663 
0.45 0.9753 0.9036 0.7919 0.6523 0.5013 0.3602 0.2553 0.2174 0.2821 
0. 50 0.9752 0.9032 0.7912 0.6512 0.5000 0.3592 0.2552 0.2192 0.2872 
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Case 2: Tate’s model 


Assume X normally distributed in Z, and Z, separately with of = oj . 
Here 


K =1, Us = 0; = 0, U, = 801 = 80g = 
Formula (25) reduces to 


? + P,P.(4 — 6p’) (1 — p)? 1 
(82) Ver) = § at o( yt): 





Lev [8] has considered a model similar to Tate’s except that in Lev’s model 
Y; values are assumed to be nonstochastic, i.e., %» and n, are nonrandom. 
In this case sampling scheme 2 is used. The probability density function 
of X; , as given in Lev’s model, is 





1 1 2 
0 Seg l-wa atm oa’, 


where 
(34) U; = Lew — Y = V1;/No for i= 1,2, +°°,%e; 
Var (Y) (Y) Vn/n, for t=n +1, 


and p is the point biserial correlation coefficient given by 


(35) p= a. mo 


o 





M, , Mo, and o” being the expected values of X, , X_ , and >-% (X; — m)’/N 
respectively. The formulas for p defined by (11) and (35) are not the same. 

Lev [8] has shown that the distribution of ~N — 27r/V 1 — r’ is the 
noncentral ¢ distribution with degrees of freedom N — 2 and parameter of 


noncentrality ~N p/~/1 — p’, r being given by (15). From the above 
results: and from a theorem given by Cramer ([1], p. 366), the following 


theorem is obtained. 


TuroreEM. For large N, /N (r — p) is normally distributed with mean 
0 and variance or lo 


ifs = 8P:Poy _ 9,2 


+201 — p)(2 — sot (Eot Pik) _ $6 (uy — 0) VPPa |, 





Case 3: Generalization of Tate’s model 
Assume X normally distributed in Z) and Z, separately. In this case 


Us = vg = 0, u, = Bot, % = Bey. 
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Formula (25) reduces to 








(36) Var@) =, E bE EE 80) 4 — oh 
(P, — P,)(1 — K) 2274 __ 2 





301 — p= i0"P Pe (a): 
eee Pa —— 


Problems Associated With the Above Models 


Testing the Hypothesis K = 1 for the Generalization of Tate’s Model 


When 7, is fixed, the critical region w(n, , a) of size a for testing the 
hypothesis K = 1 can be derived from the usual F-test where 


ot Si/ (nm. — 1) 

So/ (m — 1) 
The critical region for the unconditional test is obtained by pooling the 
critical regions w(n, , a) for various values of n, (1 < n, < N — 1). We fail 
to make decisions when n, = 0, 1, N — 1, or N. The size of this critical 
region will be less than a. This test provides the valid similar region and one- 
sided UMP test for the set of alternative hypotheses K < 1 or K > 1. 


F 


Testing the Hypothesis p = 0 
We reject the cases n, = 0 and n, = N for which r would be meaningless. 
For Tate’s model, the probability density of T = WN — 2r/V/1 — ris 


N-1 N 
 (M)pepssesm ,N,P,, p) 
39 t;N,P,,p) = *-— 


where f is the probability density function of noncentral ¢ distribution with 
degrees of freedom N — 2 and noncentrality parameter 


J; NN : 
?NNP,P(1 — p’) 
When p = 0, T follows Student’s ¢ distribution with N — 2 degrees of freedom. 


The hypothesis p = 0 will be rejected when | 7 | > k, where k is determined 
from the size of the critical region. The power function of this test is 


1 N-1 N k 
40 P,) = 1- -———_ (\prpx [ t) dt. 
( ) B(p, 1) "ear eae oe aro _ fo t 
For the generalization of Tate’s model, the hypothesis p = 0, equivalent 
to the hypothesis m, = mp» , can be tested by the Behrens-Fisher test, the 
Welch test ([7], p. 112), or a test proposed by James [6]. From this con- 
ditional test, the unconditional test can be derived as in the previous problem. 
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For large samples, the hypothesis p = 0 can be tested by using the fact 
that 
P, + PoK 
PoEPK VN 
is a unit normal variate. 
For testing the hypothesis m, = my, , a nonparametic test such as the 
Wilcoxon-Mann-Whitney test ({10], p. 116) can be used. 


Confidence Interval for p 


The confidence interval for p can be derived by using the conditional 
distribution of r or by using the theorem presented for large samples. 


Variance-Stabilizing Transformation 


A function f(r) of r can be found such that the large sample variance 
of f(r) will not involve the population pbs p and will approach the normal 
distribution more rapidly. The function f(r) may involve other parameters, 
e.g., K, P, , etc. If the problem concerning p is of main interest, then the 
parameters may be replaced by their large sample estimates. 

We shall consider only (36) since for other formulas the form of f(r) 
will be too complicated To the order of approximation indicated before, 
(36) may be written as V(r) = (1/N) [A + Bp’] (1 — p’)’, where 


. Po + PK 
(41) A) TPE? 


1 _3P+PK ,3 (1—K)" pp 
4P,P, 2P,+ PK 4 (P, + P)K)’ Ry 








(42) B= 


Then the transforming function is obtained as 


es 1 f{rVA+B 
said perms es eae (as *) 


(44) Var [f)] = 7 


Then f(r) will be normally distributed with mean f(p) and variance 1/N for 
large samples. In the particular case when K = 1, P, = Po , we get Tate’s 
result [12]. f(r) may be obtained, for given r, by using Table VB in Fisher 


({4], p. 210). 


Illustrative Example 


In a sample of 41 subjects from Vellore Medical College, India, we meas- 
ure the association between sex of student and the first-term Medical College 
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Score (First M.B.B.S.) by computing the point biserial correlation co- 
efficient r = .03891. 


Estimate of K = 1.2103. 
Estimate of Var (r) = .01376 from (25), 
.01578 from (36), 
.01644 from (32). 


Il 


Il 


Il 


TABLE 3 
M. B. B. S, Statistics Categorized by Sex 











Statistic Girl Boy Combined 
Sample size 15 26 41 
Mean score 61.2513 57.0527 58. 5888 
Variance 20. 2242 24. 4793 27.0124 
Third moment 31. 3225 - 16.4175 6.7826 

By . 1186 .0184 .0023 

Bo 1.7577 2.0278 2.4340 





Appropriateness of Different Models and Applications to Psychometrics 


“The fundamental problem of psychological testing is that of measuring 
some quantity which has a name, for example, general intelligence or 
mechanical aptitude. A measurable criterion is selected to represent the 
quantity under consideration. In the absence of any external criterion the 
total test score is sometimes used. The technique consists firstly in summoning 
for consideration all possible questions, or items, as they are called, which 
could have any bearing on the quantity to be measured, and which can be 
answered quickly and unambigouously by the subject who is being tested. 
The item is then scored 1 if the degree of association with the quantity is 
positive, and 0 if it is not” ({11], pp. iv-v). 

Following the previous notation, Y will be the item score and X is some 
measurement of the ability in question. The pbs for each item, known as 
item discrimination value, is used for the validity study when X is the ex- 
ternal criterion, or for the study of internal consistency when X is the total 
test score. The formula for the standard error of the coefficient is needed (i) 
to compare different discrimination values, (ii) to test some hypotheses con- 
cerning this coefficient, and (iii) to get a confidence interval of the discrimi- 
nation value. 

For item analysis, the assumption that X is normally distributed in Z 
is reasonable. However the assumption that X is normally distributed in 
Z, and Z, separately is not reasonable, the distribution of X being independent 
of different items. This statement is also approximately true when X is the 
total test score for a test comprised of a large number of items. 
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If Tate’s model or the generalized Tate’s model is used, then from (6), 


eon 


—t2/2 


+P. [ (too — Pa A)‘ Fem dt = flP1 y 01» 00 A)- 





(45) peo = Py 'S (to, + Py A)* dt 





Thus p40, the ith central moment of X, is a function of P; , o, , 77, and A> 
corresponding to any given item. Given the distribution of the criterion 
variable, a set of equations in four unknowns P, , o; , 7) , and A can be obtained 
by equating f; to the 7th central moment of X (7 = 1, 2,3, ---). But the param- 
eters P; , o, , o , and A, corresponding to all the items included in a test 
battery, need not (and generally will not) lie in the solution space of those 
equations. Hence, the formulas for the standard error of pbs, derived from 
these models, cannot be applied in an item analysis situation. This con- 
clusion is also true for Lev’s model. 

It is true that the computations required to use these formulas are too 
extensive for a test author to use on each of the items of an experimental 
test. But these formulas are of theoretical interest and, in addition, with 
the help of suitable tables they could be applied in practice. For item anal- 
ysis, formula (28) under the assumption that X is normal in Z would be 
most suitable from both theoretical and practical viewpoints. The variance 
stabilizing transformation for the point biserial correlation function would 
yield a scaled index of discrimination. Instead of performing this parametric 
test or large sample test, one may analyze this type of data by a rank biserial 
correlation coefficient [2]. 

The extension of this mode of approach to types of situations other than 
psychometrics will in most cases be evident. For example, the assumption 
in Tate’s model, or the generalized Tate’s model, may be reasonable if Z 
represented sex, religion, etc. Unlike biserial correlation [11], pbs will always 
lie within the closed interval [—1, 1] and the sample pbs will be the maximum 
likelihood estimate for Tate’s model or the generalized Tate’s model. 


Point Multiserial Correlation Coefficient 

Definition 

Let Y be a discrete random variable taking values y; (¢ = 1, 2, --- , 1) 
with probability P; , and X be a continuous random variable such that when 
Y = y; the conditional mean and variance of X are m, and o7 , respectively. 

Then the product moment correlation coefficient between X and Y, 
termed the point multiserial correlation coefficient (designated shortly as 
pms), is given by (46) for 7 < 7. 


l 
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Properties of p 

(i) m; = mz, = +++ = m, isa sufficient condition to make p = 0 but is 
not a necessary condition. 

(ii) When 7 = 2, p (except for its sign) is invariant for a linear trans- 
formation of y,; and y2. Thus, y,; and y2 can be replaced by 1 and 0. 

(iii) When / = 3 and if y; + y2 + ys = 0, where y, > y2 > ys, then p 
will be unchanged if y, , y2 , and y3 are replaced by 1, 0, and —1, respectively. 

(iv) For the variations of the values of y;’s p will be maximum when 
y¥; = am; +B (i = 1, --- , lL), where a is a preassigned positive constant. 

(v) When the values of y; are restricted by y, < y2 < --- < y,, they; 
values can be obtained by maximizing p subject to the above restriction; this 
principle of finding y values is suggested by the present author [3]. 

The square of the optimum value of pms when y, = am, + 6 (vide 
property iv) is equal to the square of the multiserial eta (n) coefficient defined 
by Wherry and Taylor [13]. The type of weighting indicated by property 
(iv), forces linearity rather then assumes it. While the multiserial eta ex- 
pends one degree of freedom with the weighting of the categories (of Y), 
this is known and correctable; whereas the vague partial loss of degrees of 
freedom due to the ordering of categories in the point multiserial correlation 
is not correctable [13]. Wherry advocates the use of multiserial eta, but in 
using it one will lose all the information contained in the nominal or ordinal 
scale values of Y (if the scale values are found). 


Sampling Distribution 

While deriving the sampling distribution of pms, the conditional dis- 
tribution of X in different categories of Y must be specified. Assume that the 
conditional distribution of X when Y = y; is a normal distribution with 
mean m, and variance o” . Also assume that the y; values are known. A random 
sample of size N is drawn from this bivariate universe. Let 


frequency in the zth category of Y, 


ll 


ni 


x;; = value of x of the jth individual in the 7th category of Y, 
ni t l 

i; = b> %i;/N; == ~ né./N, as D ny:/N, N= Dn . 
i=1 t=1 t=1 j 

Then the sample value of pms is 


> niélys — 9) 
[i nly: — xt| D nt; — #)’ + a > (ty — 2) | 


t=-1 j=1 





(47) r= 


The conditional distribution of r will be derived first, i.e.,n; ,Mz,°** , 
are fixed for given N. It is assumed that all n; are nonzero integers. If some 
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of the n; are zero, those categories are omitted and the value of / is altered 
thereby. 


Le > (e445 — #,)° = _ 


L 
f-1 j=1 


q 


is distributed, independently of #; , as x” with N — 1 degrees of freedom. 
Define 


U; = Vn; #; . 


Then U, , U, , --+ , U; are independently normally distributed with means 
a/nym, , -** , Vn,m, respectively and with common variance o’. 
Transform U, by the orthogonal transformation 


1 
(48) i, = D CU, G = 1, +es, I). 
In ‘particular, 
1 t 
> Nk; > Vn.U; 


j=l i=1 
t, =- = 


VN va." 
¥ Vn; (y; — g)U; 


i=1 


= ee SS 
p> nly; — 9) ] 








Let 
E(t;) = ¢; , Var (t;) _ 0”. 
Then 
= hm = VN m (gay), 
Le nmdy: — 9) 
e& = SS SS = 6 (say), 
: Vi> nly; — 9) ] 
and 
2 aes z Cis Vn; mi; 
where 


r Ci3V 0; = 0, ® Ci; Vn; (y¥;-— 9) =0 (for j > 2). 
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From the orthogonal relationship (48) 
i t U 
Lum = De=Nmw+ e+ De, 


{ol j=1 i=8 
or 
3 Uj 
La = Lam, — m)? — & = A? — 8 =»? (say). 
i=3 t=-1 
Again 


l 1 3 
b n(&; = )’ = 2 n,é — Nz’ = p 2 tj ° 
i=2 


t=1 t=1 
From the above results, 


a 


[a + * "ie 











or 

(49) fa 5 =] aie 7 = t (pay). 
3+ 
o og 


(>-}.3 3)/o” is distributed as noncentral x” with degrees of freedom | — 2 
and noncentrality parameter \*/o". Hence 


g, &é 


3 2 
o Co 


is distributed as noncentral x’ with degrees of freedom N — 2 and non- 


centrality parameter \7/o’. 
Thus the distribution of t is obtained as 








ene 
(50) p(t) dt= Td) 
el “(5) r(¥ + j ea | + i) (8)2 
: 20 2 o dt 
ie: ft (1 4. Ce r(% = 2 + i al j! . 


Hence it is found that the distribution of t, when the n, are fixed, involves 
\*/o” and 6/c. Even when 6 = 0 (i.e., p = 0), the distribution involves the 
parameter A’/o’. Without the knowledge of A?/o’, the hypothesis p = 0 
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cannot be tested using the above sampling distribution. The unconditional 
distribution of may be obtained by the usual way. The large sample standard 
error of the sample pms may be evaluated using formula (17) and a general 
model for this universe. It is not presented here due to its complicated form. 


Test of Hypothesis 


The usual analysis of variance test for testing the equality of group 
means (i.e., m, = mM, = +++ = m,) and Baritlett’s test for the equality of 
group variances (i.e., ¢j = 0; = -:- = o;) may also be carried out in this 
type of problem. But these tests will be conditional since n, , me. +++ , % 
are to be considered as fixed for given N. The unconditional test may be 
obtained by pooling the different conditional critical regions. 

One may also perform a Kruskal-Wallis one-way analysis of variance 
test ({10], pp. 184-194) for the equality of group means (i.e., p = 0). It will 
involve more computational labor but it can be applied in any general case 
without the normality assumption. 
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BOOK REVIEWS 


F. Attneave. Applications of Information Theory to Psychology: A Summary of Basic Con- 
cepts, Methods, and Results. New York: Henry Holt and Company, 1959. Pp. vii + 120. 


Information theory has had three kinds of effects on psychology, First, it has suggested 
new facts that it would be nice to know about people, e.g., how much information can a 
man handle in making judgments or passing on messages? Human engineers have been very 
amenable to such suggestions. Second, it has suggested that the informational content of a 
stimulus might be a quantity nicely related to human response, such as reaction time, 
retention and so on, since, prima facie, the human organism seems to be built to handle 
information. Third, and last, it has suggested new ways to analyze the tests we make and 
the experiments we do, since, as has been realized of late, the mathematical methods of 
information theory have great generality and apply far beyond the limited communication 
situation which the methods were developed to handle. For example, it has been recognized 
that certain of the measures of information theory are measures of correlation; Kullback 
has recently published a work devoted to the situation of information theory within prob- 
ability theory and statistics. A measure of the adequacy of an introductory book on infor- 
mation theory is how well it introduces these three topics. 

The volume under review starts with a brief chapter on how quantitative variables 
such as information, redundancy, and minimum codes are related to our rough ideas about 
such things. The second chapter is devoted to stochastic processes, and especially to the 
question of sequential dependencies. Methods are described for determining the predict- 
ability of English text and thus the informational load carried by individual letters in 
typical passages of English. Mention is made, too, of applications of the same methods of 
analysis to animal learning data. The third chapter is the heaviest in the book and deals, 
first, with Garner and Hake’s and McGill’s development of multivariate information analy- 
sis, which is treated in very considerable detail, and, second, with the application of such 
methods of analysis to the question of man’s ability to transmit information. The fourth 
chapter is a very brief discussion of “a new approach to some old problems” which the 
concepts of information theory are providing. The main idea is that such notions as organ- 
ization or patterning, figural goodness, and amenability to schematization and encoding 
may be made more precise and attacked anew under the banner of information theory. 

The treatment throughout is easy and informal, even chatty, and, for the most part, 
lucid. However, the book suffers from a lack of care, as, for example, when the author 
says (p.2) that six questions are always necessary and sufficient to locate one particular 
square out of sixty-four; some things are glossed over too lightly. The idea of redundancy 
is not adequately explained, for one thing. For another, information measures are presented 
as if objective probabilities were the correct values from which to compute them. H, the 
basic information measure, is said (p.8) to be a “‘property of a set of events.’”’? Now, one 
question that is particularly critical in psychology is whether objective probabilities should 
be used or whether, instead, one should use the expectations of some subject exposed to 
the events. The question is not given the attention it deserves. 

A more serious defect of the book is that it does not go to enough pains to disentangle 
the separate strands in the work done with information theory in the field of psychology. 
True enough, the author does point out in the introduction that information theory has 
value both in the formulation of certain psychological problems and in the analysis of 
certain psychological data, and he does say that “the Garner-Hake-McGill method of in- 
formational analysis is as completely neutral with respect to psychological schools and 
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controversies as is analysis of variance’ (p.81), but the organization of the book does not 
encourage the reader to take these points as seriously as they need to be. Questions of 
determining man’s ability to transmit information, questions of measuring information in 
stimuli and doing experiments to determine the relations between the variables of informa- 
tion and performance, and questions of the analytic possibilities of information theory 
measures are all treated together in a way which confounds them in the reader’s mind. 
Although there have not been many, there have been some attempts to apply infor- 
mation theory to testing problems, but these are not mentioned in the book. Altogether 
the book is a disappointing one from a psychometrician’s point of view. There is, however, 
a short four-page appendix on information measures and variance statistics, which brings 
out the relationship between 7(x;y), transmitted information and r, the correlation co- 


efficient, 


= 1 
rR ie ene, 


where F(z,y) is bivariate normal. Furthermore, psychometricians could well profit from a 
careful study of the section of multivariate information analysis, in Chapter 3, if they are 
prepared to consider the implications that there are for psychometric problems, 

The book, unfortunately, will do little to dispel the widespread and ill-founded im- 
pression that information theory is the special property in psychology of a group interested 
in problems of perception, memory, and man-m*~hine systems. It is insufficiently broad and 
critical in its approach and so does not give t« reader enough to get a proper bearing on 
a complex and quite confusing subject. 


Educational Testing Service Joun Ross 


CuarrRE SELLTIz, Marre Janopa, M. Drutscu, anp 8. Coox. Research Methods in Social 
Relations (Revised in one volume). New York: Holt-Dryden, 1959. Pp. xvi + 662. 


This is a good, solid book. It covers the neglected area of design and strategy of social 
research, as differentiated from the design and statistical treatment of experiments. Statis- 
tics are occasionally mentioned but only in passing. The emphasis is on the time interval 
in the research process from the formulation of the problem to the analysis of the data and 
application of the results. The topics covered include selection and formulation of a research 
problem, research design, general problems of measurement, data collection, available data 
as source material, scaling, analysis and interpretation, the research report, application of 
social research, and research and theory. 

One of the authors’ major objectives is to achieve a meaningful union of social phe- 
nomena, and statistical and methodological sophistication. The flavor of this union is given 
in a mildly ingenious example of the authors’ description of type I and type II errors: 


“The decision as to just how the balance between the two kinds of error 
should be struck must be made by the investigator. . . . In many countries it is 
considered more important to reject a hypothesis of guilt when it is false than 
to fail to accept this hypothesis when it is true; a person is considered not 
guilty so long as there is reasonable doubt as to his guilt. In other countries, 
the acceptance of a false hypothesis of guilt is deemed less costly than the 
rejection of this hypothesis if it were true; a person charged with a crime is 
considered guilty until he has demonstrated his lack of guilt’ (p. 418). 


The style of the ook is strikingly improved over the first edition. Whereas that 
edition, in two volumes, was marked by serious uneveness, this consolidated version has 
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been vastly rewritten and smoothed so that it flows as from a single author. The level is 
now uniformly elementary. This book would seem to be one that could be used in a beginning 
methodology course or in a course for students in other fields such as education or business 
administration. It emphasizes the general context and orientation to research, presents a 
general discussion of a wide variety of techniques for use in problems of social relations, 
and provides a good introduction and background for doing research in social relations. 
In actually carrying out a project, however, one would have to supplement it with the 
references and other more detailed presentations. There are no technical discussions of any 
magnitude, but rather typically a general statement and description of the issue or approach 
is presented and then reference is made to a primary source. 

In fact, if there is a drawback to this volume it is manifested as an occasional mushy 
feeling which it gives. Some discussions seem too general and vague, especially when they 
attempt to explain in literary terms what has been expressed succinctly and clearly in more 
precise language. These are matters of personal preference, of course, but for this reviewer 
some examples of these blunted discussions are: (1) a description of validity without mention 
of approaches such as Cronbach and Gleser’s to risk functions (p. 166); (2) discussion of 
content analysis without very specific comments especially on methodology (p. 406); (3) dis- 
cussion of causality without using Lazarsfeld’s diagrams which often simplify and clarify 
the issue (p. 424); (4) discussion of generalization without mentioning generalization to 
situations as well as to people (p. 416); (5) vague and questionable criticism of Guttman 
scaling (p. 376); and (6) omission of discussions by Kaplan and by Lazarsfeld on definition, 
specification of meaning, and indicators. 

There are several points of refreshing insight into the scientist at work. One apt 
example (p. 442): “The purpose of a report is not communication with oneself but com- 
munication with the audience. ... All too many (research documents) bear the stamp of a 
struggle for clarification of the author’s own thoughts.”” How uncomfortably true! 

One salutary influence this book could have is to suggest a relatively different type 
of methodology course in the behavioral sciences, one which covers the period from ex- 
perience to experiment. Too often our students are well armed with statistical acumen but 
have little conception of how to carry an idea they have to the stage where such acumen 
is usable. For filling this gap, Research Methods in Social Relations has made an important 
contribution. 

Wiuram C, Scuurz 


University of California at Berkeley 


Virernia L, Senpers. Measurement and Statistics: A Basic Text Emphasizing Behavioral 
Science Applications. New York: Oxford University Press, 1958, pp. xvi + 594. $6.00. 


This text is a bold undertaking: “The organization of the book is unusual. The various 
statistical measures are not taken up in the conventional order but in an order determined 
by the scale of measurement with which their use first becomes appropriate’’—where 
Senders uses 8. S. Stevens’ particular taxonomy for scales of measurement. I shall bow to 
this organization and review the book chapter by chapter. 

Chapter 0, “Techniques Necessary for Elementary Statistics,” seems too long and 
involved; but worse, the suggestion that the elementary mathematical techniques presented 
are ‘‘necessary’’ rather than ‘‘desirable”’ is an ill omen for what follows in the book: often 
too dogmatic a prescription of what one must learn. 

The overviewing Chapter 1 is provocative and interesting, but marred by an insipidly 
exuberant (or perhaps contrived) style which makes one embarrassed for the author. On 
page 44 Senders stumbles twice, making probability statements about parameters. That 
the mishandling of this topic is no fluke is shown repeatedly in later chapters concerned 
with statistical inference. 
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Chapter 2, a careful, accurate, and obviously devoted review of Stevens’s classic 
chapter, ‘‘Mathematics, Measurement, and Psychophysics,” from the Handbook of Experi- 
mental Psychology, is perhaps the most important chapter in the book. For it provides the 
rationalization for the unusual organization of the text and for the title of the book. (But 
make no mistake. This is a statistics book not one on measurement. Beyond repeating the 
shibboleths of Stevens it does not treat psychological measurement.) Chapter 2 repeatedly 
exhibits the philosophically naive faith that there “exists” an “actual” or “true’’ scale 
for a perticular phenomenon; the author seems to assume a degree of absolute truth in- 
herent in nature which went out of style in the nineteenth century. How one may ascribe 
meaning to the notion of determining the ‘actual’ scale remains remarkably unclear. 
But apparently we must, for Senders warns us no less than eleven times in Chapter 2 
that unless we are able to do this, our subsequent statistics are sure to be “‘wrong.”’ 

Chapters 3 through 8 spend a long, long time on descriptive statistics. The material 
is often delightfully presented (see the seduced secretaries on page 106); it is a pleasure to 
see the author’s enthusiastic intelligence at work. But the writing seems extremely sloppy. 
Swarms of little errors irritated this reader; perhaps the worst is the continual inappropriate 
use of the adverb “mathematically” as a completely irrelevant justification for obeying 
Stevens’ strictures. Whether many of the intrinsically difficult details of descriptive 
statistics are worth the student’s time is also questionable. Many teachers of the course 
would argue that students should save their intellectual energy for studying statistical 
inference. : 

Chapter 9 reviews descriptive statistics with respect to Stevens’ scales of measure- 
ment in a series of well-organized tables. 

For a book at the introductory level, I found Chapter 10, “Probability,” quite good. 
Here, Senders’ usually delightful style comes through with few inaccuracies and the chapter 
presents a seemingly appropriate treatment of elementary considerations in probability. 
(Only twice does Senders make probability statements about parameters. ) 

Chapter 11—second only in importance to Chapter 2—presents an overview of 
statistical inference, ‘“Testing Hypotheses and Establishing Margins of Error.” It is 
distinguished by an unusual density of sloppy thinking and careless writing. Starting from 
an anachronistic and inadequate definition of the null hypothesis, “.., any hypothesis 
stated in such a way that the data are given a chance to disprove it,’’ (page 357) Senders 
gives examples which, at best, are confused and confusing, and at worst, meaningless, 
Example: ‘‘The difference in the amount of learning achieved by the two methods is not 
greater than would be expected by chance alone’’ (page 358). The definition of level of 
significance seems unusual, but appears correct; however, the definition of level of confidence 
as one minus the level of significance indicates a lack of familiarity with modern statistical 
terminology. In this general chapter on statistical inference, Senders makes meaningless 
probability statements about hypotheses. Her figures illustrating power functions (pages 
372 and 378)—concocted by computing a few points and connecting them with straight 
lines—do not enhance one’s respect for the book. This chapter also provides an introduction 
to estimation. The adjective “‘best”’ is used by Senders to mean unbiased. In view of the 
meaningless probability statements in hypothesis testing, it is little wonder that the book 
badly mishandles the probabilistic aspects of interval estimation. Example: ‘“‘The prob- 
ability that the population mean lies between 90 and 110 is .95’’ (page 381). 

After 389 pages, we are finally ready to embark on a detailed consideration of the 
elements of statistical inference—and we have only 136 pages in which to do it. The organi- 
zation here is as before; the various procedures are taken up in an order determined by 
Stevens’ scales. These chapters are not so debilitated by an abundance of errors as is the 
introductory inferential chapter just reviewed; howevei, errors still appear. Most are 
things that Senders could correct by writing more carefully and by paying closer attention 
to the modern theory of statistical inference. 
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And then it’s all over. After 525 enthusiastically written pages replete with error we 
have arrived almost nowhere: only the simplest standard tests have been considered, and 
the analysis of variance has just barely been touched upon. The effort required to read 
this book far exceeds its possible value. 

More generally this book is pretentious in scope. Its aim is ambitious: a simultaneous 
consideration of both measurement and statistics. In this aim, Senders—like Stevens 
before her—has made the great error of failing explicitly to make the distinction between 
a statistical hypothesis (a statement about the probability distribution of an observable 
random variable) and a purportedly corresponding scientific hypothesis—a distinction 
which is crucial in considering the relevance of scales of measurement to statistical inference. 
For it is clearly a matter of fact that assumptions about scales of measurement are irrelevant 
to statistical hypotheses. Are such assumptions relevant in establishing an isomorphism 
between a statistical hypothesis and a scientific hypothesis? This is presently a highly 
controversial question, and most assuredly not to be presented as settled. That Senders 
ignores the controversy and thus does not recognize the depth of the water in which she 
swims makes her seem more like a drowning zealot than a careful scholar. 

Her position on this matter leads her to continual negativistic prescriptions of what 
is “illegal” in statistics. Thus the book provides grist for the mill of those pseudo-sophisti- 
cated psychological statisticians whose main concern in life seeras to be to bludgeon honest 
substantive-minded folk with imprecations to meet unimportant or irrelevant so-called 
statistical assumptions. 

Consequently, it is hard to appreciate the organization of this text. Unlike the one 
earlier book using this organization (Siegel’s Nonparametric Statistics), the present book’s 
consideration of scales of measurement seems to muddy the treatment of statistical prob- 
lems. (Of course, Siegel was wise explicitly to bring up scales of measurement because it 
enhanced his case for a stampede to nonparametric methods.) 

All in all, we have a book often delightful to read because of the obvious intelligence, 
the pleasant pawky humor, and the enthusiastic enterprise of the author. On the other 
hand, we have a carelessly written book with a density of errors that seems inconceivable, 
confounded by a naive devotion to Stevens’ scales of measurement, and apparently written 
in relatively thoroughgoing ignorance of modern statistical theory. 


Henry F, Kaiser 
University of Illinois 
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Minutes of the 


1960 ANNUAL BUSINESS MEETING 


of the 


PSYCHOMETRIC SOCIETY 


The regular Annual Meeting of the Psychometric Society was held in Chicago, 


Illinois, on Tuesday, September 6, 1960. 
meeting to order at 5:00 P.M. 


President Lloyd G. Humphreys called the 


The minutes of the previous Annual Meeting were approved, 


On a ballot for the election of two new members of the Council of Directors, 
Dr. Allen L. Edwards and Dr. Bert F. Green were elected for a term of three years, 


ending in 1963. 


Dr. William B. Michael reported for the Membership Committee. The Mem- 
bership Committee nominated 37 persons as full members, 9 student members to be 
transferred to full membership, and 25 individuals as student members. 


It was moved, seconded, and passed that the following 37 persons be elected 


as full members. 


J. E. Alman 

Daniel J. Baer 

Rolf Bargmann 
Francis J. Blaisdell 
Gordon H,. Bower 
John A, Creager 


Robert Benjamin Davis 


Robert W. Earl 


Calvin F, Esselbruegge 


Jerome Louis Fine 
Frank Garfunkel 
Charles E. Hall 


David O. Herman 
Howard F, Hjelm 
John A, Hornaday 

E. J. Hovorka 

Earl B. Hunt 
Richard M, Johnson 
Barry Keating 
Thomas R, Knapp 
Kurt S. Konigsbacher 
Paul R. Lohnes 

Juan Aranda Lopez 
Albert Joseph Macek 
Arthur C. MacKinney 


Alvin Marks 

Arthur Mittman 

H. William Morrison 
Charles H. Proctor 
Robert Radlow 
Donald Allen Schumsky 
Harry Smith, Jr. 
Fritz Stillwold 
Norman L, Vincent 
Leopold O, Walder 
James Wilson Walker 
Leroy Wolins 


It was moved, seconded, and passed that the 9 student members listed below be 


transferred to full membership. 


Carl Bereiter 
Bruce Douglas Faulds 
Hiroshi Ikeda 


George G. Karas 
John E. Overall 
Erich P. Prien 


Douglas K. Spiegel 
Francis R. Watson 
Murray Weiner 


It was moved, seconded, and passed that the 25 persons named below be elected 


as student members. 


James H, Beaird 


James C, Becknell, Jr. 


Richard J, Campbell 
John M, Devine 
Franklin L, Duff 
Ruth B. Ekstrom 
Arthur S. Gershon 
John V. Haley 


Samuel Hung 

Lorne M. Kendall 
Jim Kling sporn 
John Marshall Long 
Elizabeth Lynn 
Ronald A, Marks’ 
Victor E, McGee 
Barbara Pitcher 


John Ross 

Fumiko Samejima 

Meyer Starr 

Robert Stephen Tacker 
Francis James Vingoe 
Richard V. Wagner 

Lonnie Dean Whitehead, Jr. 
Raymond A, Wiesen 


Earle Wesley Richardson 
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It was moved, seconded, and passed that the Membership Committee be 
thanked for their excellent work. 


It was announced that Dr. John B. Carroll had been elected President of 
the Society for the term ending September 30, 1961. 


Dr. Robert P. Abelson reported for the Program Committee. Of 15 abstracts 
of papers submitted for consideration for presentation at the Annual Meeting, 12 were 
accepted. Two symposia were scheduled. A proposal for a symposium to be pre- 
sented jointly with Division 8 of the APA turned out to be impossible to schedule. 

It was moved and seconded that the report be accepted with thanks. Motion passed. 


Dr. Charles K. Wrigley reported for the Committee on the 25th Anniversary 
of the Psychometric Society. Four events were scheduled by this Committee: two 
symposia: "Psychophysics: One Hundred Years After, '' with Dr. Warren S., 
Torgerson as Chairman; and ''New Developments in Maihematical Psychology," 
with Dr. Clyde H. Coombs as Chairman; an invited address by Dr. J. P. Guilford, 
"Psychometric Progress: A Review of the Past Twenty-Five Years;" anda 
luncheon with founders, past presidents, current officers, and Mrs. L. L. 
Thurstone and Mrs. Karl J. Holzinger as guests, and with Dr. Jack W. Dunlap 
as the invited speaker. For its work the Committee had available $500 from the 
Psychometric Society, $1000 from the Psychometric Corporation and $701. 30 as 
contributions from members. Travel expenses, luncheon expenses, and the cost of 
printing the program will amount to approximately $1200, with $1000 available for 
printing the papers in Psychometrika. 


It was moved and seconded that the Society express its great appreciation to 
Dr. Wrigley and the other members of his Committee, Dr. Paul L. Dressel and Dr. 
John E, Milholland, for their excellent work in arranging the Anniversary Program. 
Motion passed unanimously. 


Dr. William B. Schrader reported for the Committee on Relations Between 
the Psychometric Suciety and the Psychometric Corporation. Drafts of two docu- 
ments have been prepared: a Certificate of Corporation of the Psychometric Society 
(equivalent to a constitution) and By-Laws of the Psychometric Society. In accord- 
ance with the present Constitution these drafts have been approved by a three-fourths 
vote of the entire membership of the Council of Directors and the Editorial Council as 
a whole, and will be submitted to the membership of the Society for a mail ballot. A 
two-thirds vote of all members responding will be required for the drafts to become 
effective. It is expected that copies of the two documents, together with an approp- 
riate ballot will be in the mail to all members in a few weeks. 


Dr. Ledyard R Tucker reported for the Auditing Committee. He stated that 
all financial matters of the Society were found to be in good order. The report was 
accepted with thanks, 


On motion Dr. John W. French was elected as Treasurer for a term of three 
years beginning October 1, 1960. 


The report of the Treasurer was presented by Dr. William B, Schrader. A 
copy is attached. The report was accepted with thanks. 


The matter of abstracting or indexing computer programs of interest to 
psychologists using quantitative methods was referred to the Editorial Council of 
Psychometrika. 


The meeting was adjourned at 5:45 P.M. 


Philip H. DuBois 
Secretary 
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RECEIPTS (Dues) 


Year Members Student Members 
1961 x - 
1960 584 2 
1959 52 15 
1958 5 * 
1957 am | iit 
644 57 


(Contributions to 25th Anniversary Fund) 
Amount Number 


$10.00 
9.80 
5.00 
3.00 
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Received with Dues for Corporation Publications 
Net overpayments 

Partial Payments 

Account Not Identified 


Total Receipts 
DISBURSEMENTS 


Psychometric Corporation (90% of dues) 
Psychometric Corporation (Publications) 
Stationery and Postage 

Secretarial Services 

Psychometric Corporation (Overpayment) 
Refund 

25th Anniversary Fund 


Total Disbursements 
BALANCE 


Balance, June 30, 1959 
Receipts, 1959-60 


Disbursements, 1959-60 
Balance, June 30, 1960 


$4,736.00 
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122.60 
0.74 
5.00 
7.00 


$5,572.64 


$4,266.90 
122.60 
128.95 
114.37 
8.25 
7.00 

20.2 


$4,668.28 








$1,197. 
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2 
$6,769. 
4 668. 
$2,101. 
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(The above report is that accepted by the membership at the 
Annual Meeting of the Psychometric Society. It was determined 
subsequently that the following figures are correct: Disburse- 
ments, 25th Anniversary Fund, $29.21; Total Disbursements, 
1959-60, $4,677.28; and Balance, June 30, 1960, $2,092. 56.) 
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PSYCHOMETRIC CORPORATION 


Statement of Receipts and Disbursements for Fiscal Year 
Ended June 30, 1960 











RECEIPTS 
Subscriptions (less agency discounts) $ 7,292.60 
Psychometric Society (90% of dues) 4,266.90 
Sales of Back Issues (less discounts) 2,728.15 
Sale of Monographs 5-8 159.50 
Interest on Savings Accounts 351.78 
Reprints 380.50 
Net overpayuents 4.60 
Psychometric Society (Overpayment) 8.25 
Refund by William Byrd Press 70.90 
Miscellaneous 2.32 

$15,265.50 

DISBURSEMENTS 
Printing and Mailing Psychometrika 

Volume 24, No. 2 through 25, No. 1 8,375.64 
Reprints 442.17 
Stipend of Managing Editor 

(7/1/59 - 6/30/60) 1,312.50 
Stipend of Assistant Editor 

(7/1/59 - 6/30/60) 687.50 
Stipend of Treasurer 

(7/1/59 - 6/30/60) 437.50 
Secretarial Services: Editorial Office 800.00 
Secretarial Services: Business Office 242.80 
Stationery and Postage 208.95 
Mailing Back Issues and Monographs 175.03 
Psychometric Society (Incorrect credit) 37.80 
Refunds 22.05 
Deposited: Metropolitan Savings and Loan Assn. 

Los Angeles, California 5,000.00 
Binding Charges 54.05 
Miscellaneous 4.95 

$17,800.94 
BALANCE AND RESERVES 
Balance, June 30, 1959 $10,785.56 
Reserve Funds, June 30, 1959 

Englewood Savings and Loan Assn. 

Englewood, Colcerado 3,500.00 

Metropolitan Savings and Loan Assn. 

Los Angeles, California 3,500.00 
Total $17,785.56 
Receipts, 1959-60 15,265.50 

Sum 33,051.06 
Disbursements 17,800.94 

Remainder $15,250.12 
Balance, June 30, 1960 $ 8,249.82 
Reserve Funds, June 30, 1960 

Englewood Savings and Loan Assn. 

Englewood, Colorado 3,500.00 

Metropolitan Savings and Loan Assn. 

Los Angeles, California 8,500.00 
Total, Balance and Reserve Funds $20 249.82 

OBLIGATIONS 
Estimated cost of Psychometrika, 

Volume 25, Nos. ot 5 

Printing and Mailing $ 6,300.00 
Stipends (7/1/60 - 12/31/60) 1,375.00 
Secretarial Services 550.00 

$ 8,225.00 
BALANCE AND RESERVES, LESS OBLIGATIONS $12,024 .82 
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